Odd ANR2716E messages

2017-10-13 Thread Thomas Denier
One of our Windows client backups has had a minor but very puzzling problem on 
three of the last four days. On each of the three days the following sequence 
of events occurred:

1.The TSM server displayed the message "ANR2716E Schedule prompter was not able 
to contact client TJVDPMHD using type 1" three minutes and ten or eleven 
seconds after the nominal starting time for the backup.
2.The contact attempt was retried successfully thirty seconds after the error 
message.
3.The backup ran successfully and with no further sign of network communication 
issues.

The client system has a network interface dedicated to TSM traffic. This 
interface is on the same subnet as one of the network interfaces on the system 
hosting the TSM server. There are 24 other client systems on the subnet. None 
of the 24 have shown any recent signs of network communications issues.

A "query node" command reports that the client system is running 64 bit Windows 
7 and using TSM 6.2.4.0 client code.

The TSM server code is at level 6.3.5.0 and is running under zSeries Linux. I 
checked the various log files in /var/log and found no sign of network errors 
within the last few days.

Does anyone know of an explanation for the odd combination of consistent 
behavior on a 24 hour time scale and inconsistent behavior on a 30 second time 
scale?

Thomas Denier,
Thomas Jefferson University
The information contained in this transmission contains privileged and 
confidential information. It is intended only for the use of the person named 
above. If you are not the intended recipient, you are hereby notified that any 
review, dissemination, distribution or duplication of this communication is 
strictly prohibited. If you are not the intended recipient, please contact the 
sender by reply email and destroy all copies of the original message.

CAUTION: Intended recipients should NOT use email communication for emergent or 
urgent health care matters.


Re: Magic Decoder Ring needed

2017-10-13 Thread Zoltan Forray
Update: This error/problem is now starting to occur once or twice-a-day and
it is usually when a "backup stgpool" of our primary disk pool is happening.

There is nothing in any of our hardware/os logs, including the PERC
controller logs. There is a Dell PERC firmware upgrade pending that is
labeled "Urgent" that we will pursue.

If this is another one of our "bad spots" in one of our disk volumes, can
someone from IBM help decode the error to perhaps point to what stgpool
volume has the "problem"?  We ran an audit on one of the 20+ volumes in
this stgpool but nothing showed up as "bad".  With over 30TB to run audits
on (and of course they are always busy), it will take a while.  The latest
message:

10/13/2017 11:55:53 AM ANR1330E The server has detected possible corruption
in an object that is being restored or moved. The actual values for the
incorrect frame are: magic 20890B50 hdr version 25350 hdr length  2320
sequence number 2114564210 data length D07D0F20 server ID 175174927 segment
ID 9270951345039929524 crc  4C5AC0C.
10/13/2017 11:55:53 AM ANR1331E Invalid frame detected.  Expected magic
53454652 sequence number   71 server id0 segment id
2720204019.


On Wed, Oct 11, 2017 at 9:33 AM, Skylar Thompson 
wrote:

>  Content preview:  I'm not aware of a fix for the problem (it's with Dell
> PERC
> H810s) but the problem manifested itself in lots and lots of media
> errors
> on a physical device, visible when you export the controller log. The
> symptoms
> for TSM included both CRC errors in the pool and also sporadically
> awful
>I/O throughput. [...]
>
>  Content analysis details:   (0.7 points, 5.0 required)
>
>   pts rule name  description
>   -- --
> 
>   0.7 SPF_NEUTRALSPF: sender does not match SPF record
> (neutral)
>  -0.0 RP_MATCHES_RCVDEnvelope sender domain matches handover relay
> domain
> X-Barracuda-Connect: mx.gs.washington.edu[128.208.8.134]
> X-Barracuda-Start-Time: 1507728824
> X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384
> X-Barracuda-URL: https://148.100.49.28:443/cgi-mod/mark.cgi
> X-Barracuda-Scan-Msg-Size: 5262
> X-Virus-Scanned: by bsmtpd at marist.edu
> X-Barracuda-BRTS-Status: 1
> X-Barracuda-Spam-Score: 0.00
> X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of
> TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=5.5 tests=
> X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.43799
> Rule breakdown below
>  pts rule name  description
>  -- --
> 
>
> I'm not aware of a fix for the problem (it's with Dell PERC H810s) but the
> problem manifested itself in lots and lots of media errors on a physical
> device, visible when you export the controller log. The symptoms for TSM
> included both CRC errors in the pool and also sporadically awful I/O
> throughput.
>
> The controller logs identified the slot with the media errors, and
> replacing the drive made all the above problems go away. Of course the real
> solution is going to be retiring these soon-to-be-EOSL'd devices, and I've
> finally got a budget to do it...
>
> I'm not actually aware of a fix for the problem, though I didn't spend a
> lot of time looking for one given that we'll be getting rid of the
> equipment in a few weeks. It could very well be an interaction between the
> RAID HBA and physical disk firmware. Unfortunately the system has a mix of
> disk vendors since Dell isn't consistent about which vendor they ship for
> replacements, but the drive I identified was a Fujitsu MBD2300RC.
>
> On Tue, Oct 10, 2017 at 02:18:01PM -0400, Zoltan Forray wrote:
> > Thank you for the info.  We have started running AUDIT's but with 30TB+
> in
> > this disk stgpool, it will take a while.  I am very interested in
> > additional details on the RAID firmware issue you mentioned - any
> specifics
> > would be very helpful.  AFAIK, we are up-to-date on all Dell firmware (we
> > patch fairly regularly).
> >
> > Within the past 9-months, this server has had 3-diskpool volumes (all
> part
> > of RAID-5 arrays) suddenly become "bad", requiring full restores, with no
> > explanation since there was no sign of hardware problems. While I did
> open
> > a PMR with IBM, by the time they looked at my last failure, they said
> there
> > was nothing they could do to analyze the problem and to call them back
> the
> > next time it happens.
> >
> > On Tue, Oct 10, 2017 at 2:04 PM, Skylar Thompson <
> skyl...@u.washington.edu>
> > wrote:
> >
> > > Hi Zoltan,
> > >
> > > We ran into this recently, and it was caused by a firmware bug in a
> RAID
> > > adapter that caused it not to fail and obviously-failing disk in our
> disk
> > > spool. We followed the procedure here:
> > >
> > > https://www.ibm.com/support/knowledgecenter/en/SSGSG7_7.1.
> > > 6/tshoot/r_pdg_1330_1331_msg.html
> > >
> >