Re: zfs arc and amount of wired memory

2012-02-09 Thread Andriy Gapon
on 09/02/2012 06:27 Eugene M. Zheganin said the following:
> The output I promised (if it's MORE acceptable in the form of a link to a 
> paste
> site, just say it):

I prefer links, but both ways are acceptable to me.
Just one more hint on the reporting.  The most useful reports are coherent
reports.  That is, I now have your older reports from top and zfs-stat and I
have newer vmstat reports.  But I do not have all the reports taken at about the
same time, so I don't have a coherent picture of a system state.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: zfs arc and amount of wired memory

2012-02-09 Thread Andriy Gapon
on 09/02/2012 10:33 Andriy Gapon said the following:
> on 09/02/2012 06:27 Eugene M. Zheganin said the following:
>> The output I promised (if it's MORE acceptable in the form of a link to a 
>> paste
>> site, just say it):
> 
> I prefer links, but both ways are acceptable to me.
> Just one more hint on the reporting.  The most useful reports are coherent
> reports.  That is, I now have your older reports from top and zfs-stat and I
> have newer vmstat reports.  But I do not have all the reports taken at about 
> the
> same time, so I don't have a coherent picture of a system state.
> 

And please take the reports after discrepancy between ARC size an wired size is
large enough, like e.g. 1GB.  That's when they are useful.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: zfs arc and amount of wired memory

2012-02-09 Thread Eugene M. Zheganin

Hi.

On 09.02.2012 14:35, Andriy Gapon wrote:

And please take the reports after discrepancy between ARC size an wired size is
large enough, like e.g. 1GB.  That's when they are useful.

Okay, I wrote a short script capturing sequence of top -b/zfs-stats 
-a/vmstat -m/vmstat -z in a timestamped file and put it in a crontab 
every hour.
I will provide the files it creates (or a subset of files, if there will 
be too many) after the system will enter a deadlock again.

This time varies from one week to two.

Thanks.
Eugene.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: zfs arc and amount of wired memory

2012-02-09 Thread Eugene M. Zheganin

Hi.

On 09.02.2012 14:35, Andriy Gapon wrote:

And please take the reports after discrepancy between ARC size an wired size is
large enough, like e.g. 1GB.  That's when they are useful.

One more thing - this machine is running a debug/ddb kernel, so just in 
order to save two weeks - when/if it will enter a deadlock, do you (or 
anyone else) need crashdump or anything else I can provide using ddb in 
a deadlock ?


Thanks.
Eugene.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: zfs arc and amount of wired memory

2012-02-09 Thread Alexander Leidinger
Hi,

if you are not using USB3 and a fast memory stick, it will be slower than 
swapping to disk.

Bye,
Alexander.

-- 
Send via an Android device, please forgive brevity and typographic and spelling 
errors. 

Freddie Cash  hat geschrieben:On Wed, Feb 8, 2012 at 10:25 
AM, Eugene M. Zheganin  wrote:
> On 08.02.2012 18:15, Alexander Leidinger wrote:
>> I can't remember to have seen any mention of SWAP on ZFS being safe
>> now. So if nobody can provide a reference to a place which tells that
>> the problems with SWAP on ZFS are fixed:
>>  1. do not use SWAP on ZFS
>>  2. see 1.
>>  3. check if you see the same problem without SWAP on ZFS (btw. see 1.)
>>
> So, if a swap have to be used, and, it has to be backed up with something
> like gmirror so it won't come down with one of the disks, there's no need to
> use zfs for system.
>
> This makes zfs only useful in cases where you need to store something on a
> couple+ of terabytes, still having OS on ufs. Occam's razor and so on.

Or, you plug a USB stick into the back (or even inside the case as a
lot of mobos have internal USB connectors now) and use that for swap.

-- 
Freddie Cash
fjwc...@gmail.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: zfs arc and amount of wired memory

2012-02-09 Thread Alexander Leidinger
Hi,

this only applies to old systems (slooow disks, no NCQ support), or very fast 
USB3 memory sticks. Current (I would say at least 2-3 year old) hardware is 
slowed down by USB2.

Bye,
Alexander.

-- 
Send via an Android device, please forgive brevity and typographic and spelling 
errors. 

Freddie Cash  hat geschrieben:On Wed, Feb 8, 2012 at 10:40 
AM, Freddie Cash  wrote:
> On Wed, Feb 8, 2012 at 10:25 AM, Eugene M. Zheganin  
> wrote:
>> On 08.02.2012 18:15, Alexander Leidinger wrote:
>>> I can't remember to have seen any mention of SWAP on ZFS being safe
>>> now. So if nobody can provide a reference to a place which tells that
>>> the problems with SWAP on ZFS are fixed:
>>>  1. do not use SWAP on ZFS
>>>  2. see 1.
>>>  3. check if you see the same problem without SWAP on ZFS (btw. see 1.)
>>>
>> So, if a swap have to be used, and, it has to be backed up with something
>> like gmirror so it won't come down with one of the disks, there's no need to
>> use zfs for system.
>>
>> This makes zfs only useful in cases where you need to store something on a
>> couple+ of terabytes, still having OS on ufs. Occam's razor and so on.
>
> Or, you plug a USB stick into the back (or even inside the case as a
> lot of mobos have internal USB connectors now) and use that for swap.

That also works well for adding L2ARC (cache) to the ZFS pool as well.

-- 
Freddie Cash
fjwc...@gmail.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: zfs arc and amount of wired memory

2012-02-09 Thread Alexander Leidinger

Hi,

a possible soution would be to start a wiki pagee with what you know, e.g. a 
page which explains that solaris and zio* belong to ZFS. Over time people can 
extend with additional info.

Bye,
Alexander.

-- 
Send via an Android device, please forgive brevity and typographic and spelling 
errors. 

Jeremy Chadwick  hat geschrieben:On Wed, Feb 08, 2012 
at 10:29:36PM +0200, Andriy Gapon wrote:
> on 08/02/2012 12:31 Eugene M. Zheganin said the following:
> > Hi.
> > 
> > On 08.02.2012 02:17, Andriy Gapon wrote:
> >> [output snipped]
> >>
> >> Thank you.  I don't see anything suspicious/unusual there.
> >> Just case, do you have ZFS dedup enabled by a chance?
> >>
> >> I think that examination of vmstat -m and vmstat -z outputs may provide 
> >> some
> >> clues as to what got all that memory wired.
> >>
> > Nope, I don't have deduplication feature enabled.
> 
> OK.  So, did you have a chance to inspect vmstat -m and vmstat -z?

Andriy,

Politely -- recommending this to a user is a good choice of action, but
the problem is that no user, even an experienced user, is going to know
what all of the "Types" (vmstat -m) or "ITEMs" (vmstat -z) correlate
with on the system.

For example, for vmstat -m, the ITEM name is "solaris".  For vmstat -z,
the Types are named zio_* but I have a feeling there are more than just
that which pertain to ZFS.  I'm having to make *assumptions*.

The FreeBSD VM is highly complex and is not "easy to understand" even
remotely.  It becomes more complex when you consider that we use terms
like "wired", "active", "inactive", "cache", and "free" -- and none of
them, in simple English terms, actually represent the words chosen for
what they do.

Furthermore, the only definition I've been able to find over the years
for how any of these work, what they do/mean, etc. is here:

http://www.freebsd.org/doc/en/books/arch-handbook/vm.html

And this piece of documentation is only useful for people who understand
VMs (note: it was written by Matt Dillon, for example).  It is not
useful for end-users trying to track down what within the kernel is
actually eating up memory.  "vmstat -m" is as best as it's going to get,
and like I said, with the ITEM names being borderline ambiguous
(depending on what you're looking for -- with VFS and so on it's spread
all over the place), this becomes a very tedious task, where the user or
admin have to continually ask developers on the mailing lists what it is
they're looking at.

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: zfs arc and amount of wired memory

2012-02-09 Thread Alexander Leidinger
Hi,

feel free to register with FirstnameLastname in the wiki and tell us about it. 
We provide write access to people which seriously want to help improve the wiki 
content.

Bye,
Alexander.

-- 
Send via an Android device, please forgive brevity and typographic and spelling 
errors. 

Charles Sprickman  hat geschrieben:
On Feb 8, 2012, at 7:43 PM, Artem Belevich wrote:

> On Wed, Feb 8, 2012 at 4:28 PM, Jeremy Chadwick
>  wrote:
>> On Thu, Feb 09, 2012 at 01:11:36AM +0100, Miroslav Lachman wrote:
> ...
>>> ARC Size:
>>>  Current Size: 1769 MB (arcsize)
>>>  Target Size (Adaptive):   512 MB (c)
>>>  Min Size (Hard Limit):    512 MB (zfs_arc_min)
>>>  Max Size (Hard Limit):    3584 MB (zfs_arc_max)
>>> 
>>> The target size is going down to the min size and after few more
>>> days, the system is so slow, that I must reboot the machine. Then it
>>> is running fine for about 107 days and then it all repeat again.
>>> 
>>> You can see more on MRTG graphs
>>> http://freebsd.quip.cz/ext/2012/2012-02-08-kiwi-mrtg-12-15/
>>> You can see links to other useful informations on top of the page
>>> (arc_summary, top, dmesg, fs usage, loader.conf)
>>> 
>>> There you can see nightly backups (higher CPU load started at
>>> 01:13), otherwise the machine is idle.
>>> 
>>> It coresponds with ARC target size lowering in last 5 days
>>> http://freebsd.quip.cz/ext/2012/2012-02-08-kiwi-mrtg-12-15/local_zfs_arcstats_size.html
>>> 
>>> And with ARC metadata cache overflowing the limit in last 5 days
>>> http://freebsd.quip.cz/ext/2012/2012-02-08-kiwi-mrtg-12-15/local_zfs_vfs_meta.html
>>> 
>>> I don't know what's going on and I don't know if it is something
>>> know / fixed in newer releases. We are running a few more ZFS
>>> systems on 8.2 without this issue. But those systems are in
>>> different roles.
>> 
>> This sounds like the... damn, what is it called... some kind of internal
>> "counter" or "ticks" thing within the ZFS code that was discovered to
>> only begin happening after a certain period of time (which correlated to
>> some number of days, possibly 107).  I'm sorry that I can't be more
>> specific, but it's been discussed heavily on the lists in the past, and
>> fixes for all of that were committed to RELENG_8.  I wish I could
>> remember the name of the function or macro or variable name it pertained
>> to, something like LTHAW or TLOCK or something like that.  I would say
>> "I don't know why I can't remember", but I do know why I can't remember:
>> because I gave up trying to track all of these problems.
>> 
>> Does someone else remember this issue?  CC'ing Martin who might remember
>> for certain.
> 
> It's LBOLT. :-)
> 
> And there was more than one related integer overflow. One of them
> manifested itself as L2ARC feeding thread hogging CPU time after about
> a month of uptime. Another one caused issue with ARC reclaim after 107
> days. See more details in this thread:
> 
> http://lists.freebsd.org/pipermail/freebsd-fs/2011-May/011584.html

This would be an excellent piece of information to have on one of the ZFS
wiki pages.  The 107 day issue exists post-8.2, correct?  Anyone on this 
cc: list have permissions to edit those pages?

Thanks,

Charles

> 
> --Artem
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Mike Tancsa
On 2/8/2012 5:46 PM, Alexander Motin wrote:
> 
> READ LOG EXT for NCQ, same as REQUEST SENSE for ATAPI sent by every
> specific controller driver. In this case by siis_issue_recovery()
> function in dev/siis/siis.c. In case of proper READ LOG EXT completion,
> fetched status returned to CAM together with original command.

Hi,
Is there a way to find out which drive is causing these errors ?
Looking at the logs on the various drives, they all seem to have the odd
non zero value.  I suspect it might be a Segate Disk as smartctl flags
it as having bad firmware issues


=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11
Device Model: ST31000333AS
Serial Number:9TE14SRV
LU WWN Device Id: 5 000c50 010a39664
Firmware Version: SD35
User Capacity:1,000,204,886,016 bytes [1.00 TB]
Sector Size:  512 bytes logical/physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:Thu Feb  9 09:40:56 2012 EST

==> WARNING: There are known problems with these drives,
see the following Seagate web pages:
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957

> 


-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Jeremy Chadwick
On Thu, Feb 09, 2012 at 09:43:01AM -0500, Mike Tancsa wrote:
> On 2/8/2012 5:46 PM, Alexander Motin wrote:
> > 
> > READ LOG EXT for NCQ, same as REQUEST SENSE for ATAPI sent by every
> > specific controller driver. In this case by siis_issue_recovery()
> > function in dev/siis/siis.c. In case of proper READ LOG EXT completion,
> > fetched status returned to CAM together with original command.
> 
> Hi,
>   Is there a way to find out which drive is causing these errors ?
> Looking at the logs on the various drives, they all seem to have the odd
> non zero value.  I suspect it might be a Segate Disk as smartctl flags
> it as having bad firmware issues
> 
> 
> === START OF INFORMATION SECTION ===
> Model Family: Seagate Barracuda 7200.11
> Device Model: ST31000333AS
> Serial Number:9TE14SRV
> LU WWN Device Id: 5 000c50 010a39664
> Firmware Version: SD35
> User Capacity:1,000,204,886,016 bytes [1.00 TB]
> Sector Size:  512 bytes logical/physical
> Device is:In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 4
> Local Time is:Thu Feb  9 09:40:56 2012 EST
> 
> ==> WARNING: There are known problems with these drives,
> see the following Seagate web pages:
> http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
> http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
> http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957

The URLs listed are for firmware-level problems with this model of
Seagate drive.  This is a very famous firmware issue and got a lot of
media attention.  The bugs with that firmware, however, would not appear
as what you are seeing.

You stated in your original mail that you "added a port multiplier" then
started getting these errors.  You then provided SMART output of
/dev/ada9, so I made the assumption you had managed to figure out what
device was causing the problem.

I have to assume that devices connected on a port multiplier show up on
a separate scbusX number.  This is from your original mail:

> # camcontrol devlist
>at scbus0 target 0 lun 0 (pass0,ada0)
>at scbus0 target 1 lun 0 (pass1,ada1)
>at scbus0 target 2 lun 0 (pass2,ada2)
>at scbus0 target 3 lun 0 (pass3,ada3)
> at scbus0 target 15 lun 0 (pass4,pmp1)
>at scbus1 target 0 lun 0 (pass5,ada4)
>at scbus1 target 1 lun 0 (pass6,ada5)
>at scbus1 target 2 lun 0 (pass7,ada6)
>at scbus1 target 3 lun 0 (pass8,ada7)
>at scbus1 target 4 lun 0 (pass9,ada8)
> at scbus1 target 15 lun 0 (pass10,pmp0)
> at scbus4 target 0 lun 0 (pass11,da0)
>at scbus4 target 0 lun 1 (pass12,da1)
>at scbus4 target 16 lun 0 (pass13)
> at scbus5 target 0 lun 0 (pass14,da2)
> at scbus6 target 0 lun 0 (pass15,ada9)
> at scbus7 target 0 lun 0 (pass16,ada10)
> at scbus8 target 0 lun 0 (pass17,ada11)
>at scbus11 target 0 lun 0 (pass18,ada12)

Based on this, and assuming my understanding of how this setup works --
and please note I could be wrong, these port multiplier things I have no
familiarity with personally -- but it looks (to me) like this:

scbus0
  --> Associated with Port Multiplier device pmp1
  --> Disk ada0
  --> Disk ada1
  --> Disk ada2
  --> Disk ada3

scbus1
  --> Associated with Port Multiplier device pmp0
  --> Disk ada4
  --> Disk ada5
  --> Disk ada6
  --> Disk ada7
  --> Disk ada8

scbus4
  --> Appeaars to be a Areca controller of some kind, in RAID
  --> Disk da0, volume "usrvar" 
  --> Disk da1, volume "backup1"

scbus5
  --> Not sure what this thing is
  --> Disk or "thing" da2

scbus6
  --> Disk ada9

scbus7
  --> Disk ada10

scbus8
  --> Disk ada11

scbus11
  --> Disk ada12

So which Port Multiplier did you add?  The one at scbus0 or scbus1?

A full dmesg (not just a snippet) would probably be helpful here.  What
you provided in your first post was too terse, especially given how many
disks you have in this system.  :-)

I really see no problem with looking at all disks -- specifically disks
ada0 through ada3, and ada4 through ada8 -- to determine which one may
be having problems.  You're welcome to run "smartctl -a" on each one and
put them up on the web, preferably segregated by disk name (e.g.
ada0.txt, ada1.txt, etc.) and I can review them all.

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Mike Tancsa
On 2/9/2012 10:22 AM, Jeremy Chadwick wrote:
> 
> I have to assume that devices connected on a port multiplier show up on
> a separate scbusX number.  This is from your original mail:

> Based on this, and assuming my understanding of how this setup works --
> and please note I could be wrong, these port multiplier things I have no
> familiarity with personally -- but it looks (to me) like this:
> 
> scbus0
>   --> Associated with Port Multiplier device pmp1
>   --> Disk ada0
>   --> Disk ada1
>   --> Disk ada2
>   --> Disk ada3

Correct. This is the original hardware.  It too was showing the odd
error prior to adding the new set of disks to expand the zfs pool.  e.g.
here are some errors on the original PM

Feb  4 22:55:02 backup3 kernel: siisch0: Timeout on slot 24
Feb  4 22:55:02 backup3 kernel: siisch0: siis_timeout is 0004 ss
25002a00 rs 25002a00 es  sts 80182000 serr 
Feb  4 22:55:02 backup3 kernel: siisch0:  ... waiting for slots 24002a00
Feb  4 22:55:02 backup3 kernel: siisch0: Timeout on slot 13
Feb  4 22:55:02 backup3 kernel: siisch0: siis_timeout is 0004 ss
25002a00 rs 25002a00 es  sts 80182000 serr 
Feb  4 22:55:02 backup3 kernel: siisch0:  ... waiting for slots 24000a00
Feb  4 22:55:02 backup3 kernel: siisch0: Timeout on slot 29
Feb  4 22:55:02 backup3 kernel: siisch0: siis_timeout is 0004 ss
25002a00 rs 25002a00 es  sts 80182000 serr 
Feb  4 22:55:02 backup3 kernel: siisch0:  ... waiting for slots 04000a00
Feb  4 22:55:02 backup3 kernel: siisch0: Timeout on slot 11


> 
> scbus1
>   --> Associated with Port Multiplier device pmp0
>   --> Disk ada4
>   --> Disk ada5
>   --> Disk ada6
>   --> Disk ada7
>   --> Disk ada8

Correct, this is the new PM. 4 disks in use, and one spare.

> 
> scbus4
>   --> Appeaars to be a Areca controller of some kind, in RAID

yes.

>   --> Disk da0, volume "usrvar" 
>   --> Disk da1, volume "backup1"
> 
> scbus5
>   --> Not sure what this thing is

3ware with a pair of faster disks that holds a large DB to slice and
dice netflow data.

>   --> Disk or "thing" da2
> 
> scbus6
> scbus7
> scbus8
> scbus11
>   --> Disk ada12

Disks off the motherboard.

> 
> So which Port Multiplier did you add?  The one at scbus0 or scbus1?

1
   at scbus1 target 0 lun 0 (pass5,ada4)
   at scbus1 target 1 lun 0 (pass6,ada5)
   at scbus1 target 2 lun 0 (pass7,ada6)
   at scbus1 target 3 lun 0 (pass8,ada7)
   at scbus1 target 4 lun 0 (pass9,ada8)
at scbus1 target 15 lun 0 (pass10,pmp0)





> 
> A full dmesg (not just a snippet) would probably be helpful here.  What
> you provided in your first post was too terse, especially given how many
> disks you have in this system.  :-)
> 
> I really see no problem with looking at all disks -- specifically disks
> ada0 through ada3, and ada4 through ada8 -- to determine which one may
> be having problems.  You're welcome to run "smartctl -a" on each one and
> put them up on the web, preferably segregated by disk name (e.g.
> ada0.txt, ada1.txt, etc.) and I can review them all.

Actually, I just had a look at another server at our DR site. Its
hardware has not changed in a bit, but I did bring the kernel uptodate.
Its now logging the odd 'READ LOG EXT' error as well.  Its kernel is
from Jan 22.  Prior to that kernel update, I had not seen these errors.
 Something in the driver (ahci or cam layer?) that has changed perhaps ?

Feb  4 11:12:36 offsite kernel: siisch1: Error while READ LOG EXT

The output is in one giant txt file.  But each section has the heading
of the disk (for i in `jot 10 0`;do echo " ada$i
==" >> d.rep; smartctl -x /dev/ada$i >>d.rep;smartctl -l
gplog,0x10 /dev/ada$i >> d.rep;done;)



http://www.tancsa.com/ahci.txt


---Mike







-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: serious packet routing issue causing ntpd high load?

2012-02-09 Thread Qing Li
Hi Vlad,

Sorry about the delayed response. No, this one just fell through the cracks.

Has anyone responded ?  Does it still exist in 9.x ?

--Qing

On Mon, Feb 6, 2012 at 10:16 AM, Vlad Galu  wrote:
> Hi Qing,
>
> Any luck with this?
>
> Thanks
> Vlad
>
>
> On Thu, Nov 3, 2011 at 2:05 PM, Li, Qing  wrote:
>>
>> This endless route lookup miss message problem is reproducible without
>> FLOWTABLE.  The problem is with the multiple FIBs. I cannot reproduce
>> this problem in my home network but the problem is easily seen at work.
>>
>> The route lookup miss itself in multi-FIBs configuration may be normal
>> depending on the actual system configuration. It's the flooding of
>> RTM_MISS messages that is abnormal. For example, if the route to the
>> DNS servers is not configured in all FIBs, then the RTM_MISS
>> message will be generated when an userland application sends to an
>> explicit IP address in a specific FIB.
>>
>> In any case, I can reproduce the issue consistently and just trying to get
>> a few uninterrupted
>> hours to get it done.
>>
>> --Qing
>>
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Gary Palmer
On Thu, Feb 09, 2012 at 07:22:40AM -0800, Jeremy Chadwick wrote:
> I have to assume that devices connected on a port multiplier show up on
> a separate scbusX number.  This is from your original mail:
> 
> > # camcontrol devlist
> >at scbus0 target 0 lun 0 (pass0,ada0)
> >at scbus0 target 1 lun 0 (pass1,ada1)
> >at scbus0 target 2 lun 0 (pass2,ada2)
> >at scbus0 target 3 lun 0 (pass3,ada3)
> > at scbus0 target 15 lun 0 (pass4,pmp1)
> >at scbus1 target 0 lun 0 (pass5,ada4)
> >at scbus1 target 1 lun 0 (pass6,ada5)
> >at scbus1 target 2 lun 0 (pass7,ada6)
> >at scbus1 target 3 lun 0 (pass8,ada7)
> >at scbus1 target 4 lun 0 (pass9,ada8)
> > at scbus1 target 15 lun 0 (pass10,pmp0)
> > at scbus4 target 0 lun 0 (pass11,da0)
> >at scbus4 target 0 lun 1 (pass12,da1)
> >at scbus4 target 16 lun 0 (pass13)
> > at scbus5 target 0 lun 0 (pass14,da2)
> > at scbus6 target 0 lun 0 (pass15,ada9)
> > at scbus7 target 0 lun 0 (pass16,ada10)
> > at scbus8 target 0 lun 0 (pass17,ada11)
> >at scbus11 target 0 lun 0 (pass18,ada12)
> 
> Based on this, and assuming my understanding of how this setup works --
> and please note I could be wrong, these port multiplier things I have no
> familiarity with personally -- but it looks (to me) like this:
> 
> scbus5
>   --> Not sure what this thing is
>   --> Disk or "thing" da2

3ware 9650SE controller (twa driver I beleive)

Gary
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Jeremy Chadwick
On Thu, Feb 09, 2012 at 11:12:06AM -0500, Mike Tancsa wrote:
> {snipping}
>
> > So which Port Multiplier did you add?  The one at scbus0 or scbus1?
> 
> 1
>at scbus1 target 0 lun 0 (pass5,ada4)
>at scbus1 target 1 lun 0 (pass6,ada5)
>at scbus1 target 2 lun 0 (pass7,ada6)
>at scbus1 target 3 lun 0 (pass8,ada7)
>at scbus1 target 4 lun 0 (pass9,ada8)
> at scbus1 target 15 lun 0 (pass10,pmp0)

I'll provide analysis for all 5 of these disks below.

> > A full dmesg (not just a snippet) would probably be helpful here.  What
> > you provided in your first post was too terse, especially given how many
> > disks you have in this system.  :-)
> > 
> > I really see no problem with looking at all disks -- specifically disks
> > ada0 through ada3, and ada4 through ada8 -- to determine which one may
> > be having problems.  You're welcome to run "smartctl -a" on each one and
> > put them up on the web, preferably segregated by disk name (e.g.
> > ada0.txt, ada1.txt, etc.) and I can review them all.
> 
> Actually, I just had a look at another server at our DR site. Its
> hardware has not changed in a bit, but I did bring the kernel uptodate.
> Its now logging the odd 'READ LOG EXT' error as well.  Its kernel is
> from Jan 22.  Prior to that kernel update, I had not seen these errors.
>  Something in the driver (ahci or cam layer?) that has changed perhaps ?
> 
> Feb  4 11:12:36 offsite kernel: siisch1: Error while READ LOG EXT

Perhaps, but mav@ would be the authority on that.

> http://www.tancsa.com/ahci.txt

So here are the results of analysis for disks ada4 through ada8:

ada4
  --> When the below errors happened are 100% unknown.  Just noting
  that here.
  --> SMART attribute 199 shows 13 CRC errors.  These would be
  caused by issues between the disk and the device its
  attached to (port multiplier I guess).  Causes could be
  bad SATA cables, bad ports, dirty/dusty ports, or flaky
  PCB (on the disk itself).
  --> SATA PHY log/counters confirms above problem:
  ID  Size Value  Description
  0x0001  2   13  Command failed due to ICRC error
  0x0002  2   13  R_ERR response for data FIS
  0x0003  2   13  R_ERR response for device-to-host data FIS
  --> Given this behaviour, possibly the ATA commands submit which
  experienced errors were NCQ-related.
  --> The NCQ command error log does have non-zero values in it.
  The format of the output is proprietary, sadly, and smartmontools
  does not know how to decode it.  But, compare it to your other
  drives and you'll see there is non-zero data there.
  --> This is a likely candidate for the behaviour seen on this PM.

ada5
  --> When the below errors happened are 100% unknown.  Just noting
  that here.
  --> SMART attribute 199 shows 11 CRC errors.  These would be
  caused by issues between the disk and the device its
  attached to (port multiplier I guess).  Causes could be
  bad SATA cables, bad ports, dirty/dusty ports, or flaky
  PCB (on the disk itself).
  --> SATA PHY log/counters confirms above problem:
  ID  Size Value  Description
  0x0001  2   11  Command failed due to ICRC error
  0x0002  2   11  R_ERR response for data FIS
  0x0003  2   11  R_ERR response for device-to-host data FIS
  --> Given this behaviour, possibly the ATA commands submit which
  experienced errors were NCQ-related.
  --> The NCQ command error log does have non-zero values in it.
  The format of the output is proprietary, sadly, and smartmontools
  does not know how to decode it.  But, compare it to your other
  drives and you'll see there is non-zero data there.
  --> This is a likely candidate for the behaviour seen on this PM.

ada6
  --> When the below errors happened are 100% unknown.  Just noting
  that here.
  --> SMART attribute 199 shows 8 CRC errors.  These would be
  caused by issues between the disk and the device its
  attached to (port multiplier I guess).  Causes could be
  bad SATA cables, bad ports, dirty/dusty ports, or flaky
  PCB (on the disk itself).
  --> SATA PHY log/counters confirms above problem:
  ID  Size Value  Description
  0x0001  28  Command failed due to ICRC error
  0x0002  28  R_ERR response for data FIS
  0x0003  28  R_ERR response for device-to-host data FIS
  --> Given this behaviour, possibly the ATA commands submit which
  experienced errors were NCQ-related.
  --> The NCQ command error log does have non-zero values in it.
  The format of the output is proprietary, sadly, and smartmontools
  does not know how to decode it.  But, compare it to your other
  drives and you'll see there is non-zero data there.
  --> This is a likely candidate for the behaviour seen on this PM.

ada7
  --> When the below errors happened are 100% unknown.  Just noting
  that here.
 

Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Mike Tancsa
On 2/9/2012 11:34 AM, Jeremy Chadwick wrote:
> 
> You will probably need to "track these drives" on a regular basis.  That
> is to say, set up some cronjob or similar that logs the above output to
> a file (appends data to it), specifically output from smartctl -A (not
> -a and not -x) and smartctl -l sataphy on a per-disk basis.  smartd can
> track SMART attribute changes, but does not track GPLog changes.  Make
> sure to put timestamps in your logs.

Thanks very much for having a look, and the suggestions. It think this
is the way to go to see which drive my have errors incrementing.
Alexander, is there a better way you can suggest ?

> 
> As for fixing the problem: I have no idea how you would go about this.
> Use of port multipliers involves additional cables, possibly of shoddy
> quality, or other components which may not be decent/reliable.  


Possibly.  Cables are one of those things I am happy to "pay extra for
better quality" but how does one assess quality of such parts.

> 
> Overall, this is just one of the many reasons why I avoid PMs, as well
> as avoid eSATA (especially eSATA).  

Yeah, at some point it doesnt really work with too many PMs, especially
if you cant query the thing to find out where things are "bad".  I think
for the next version of this box I will use the newer generation 3ware
SAS/SATA controller

---Mike



-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


known problems with 8.x and HP DL16 G5 server?

2012-02-09 Thread Julian Elischer

does anyone know of problems with freebsd and this system?

the kernel We tried to boot seems to stop somewhere in the ahci probing.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: known problems with 8.x and HP DL16 G5 server?

2012-02-09 Thread Jeremy Chadwick
On Thu, Feb 09, 2012 at 01:48:29PM -0800, Julian Elischer wrote:
> does anyone know of problems with freebsd and this system?
> 
> the kernel We tried to boot seems to stop somewhere in the ahci probing.

Few things:

1) Possible to get full console output (e.g. serial, etc.) from a verbose
boot?

2) Can you also provide the exact release/tag/kernel/thing you're trying
to install or upgrade to ("8.x" is a little vague; there are all sorts
of changes that happen between tags).  For example 8.1 is not going to
behave the same necessarily as 8.2.

3) When you say "ahci probing", are you booting a standard installation
CD/DVD/memstick of, say, 8.2?  If so, those won't make use of the
AHCI-to-CAM translation layer (and that AHCI code is also different than
the native-ATA-AHCI code), so you might try, when booting the system,
dropping to the loader prompt and issuing "load ahci.ko" before typing
"boot".  See if that helps.  If it does, great, use it (ahci_load="yes"
in /boot/loader.conf) permanently (and benefit from things like NCQ
too).

4) If it's an Intel ESB2 controller, I believe there were some fixes or
identification shims put in place for this in recent RELENG_8, which
wouldn't be available in RELENG_8_2 or 8.2-RELEASE CD/DVDs.  I could be
remembering the wrong controller though.  Hmm...

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: serious packet routing issue causing ntpd high load?

2012-02-09 Thread Steven Hartland
- Original Message - 
From: "Qing Li" 

Sorry about the delayed response. No, this one just fell through the cracks.

Has anyone responded ?  Does it still exist in 9.x ?


We discovered yesterday that adding the following routes,
which are present in: /etc/rc.d/network_ipv6, but not
active unless ipv6_enable="YES" is set fixed the issue:-

route add -inet6 :::0.0.0.0 -prefixlen 96 ::1 -reject
route add -inet6 ::0.0.0.0 -prefixlen 96 ::1 -reject

I haven't confirmed but this is reported to be set
by default on 9.x due to the changes in rc.d scripts.

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: known problems with 8.x and HP DL16 G5 server?

2012-02-09 Thread Jeremy Chadwick
On Thu, Feb 09, 2012 at 04:02:12PM -0800, Julian Elischer wrote:
> On 2/9/12 1:56 PM, Jeremy Chadwick wrote:
> >On Thu, Feb 09, 2012 at 01:48:29PM -0800, Julian Elischer wrote:
> >>does anyone know of problems with freebsd and this system?
> >>
> >>the kernel We tried to boot seems to stop somewhere in the ahci probing.
> >Few things:
> >
> >1) Possible to get full console output (e.g. serial, etc.) from a verbose
> >boot?
> 
> it's freebsd 8.2 from a TrueNAS/FreeNAS. I'm actually at ix-systems
> at the
> moment.. but I wasnhoping someone could save us some time by saying
> "Oh yeah, merge in change number xx"
> 
> >2) Can you also provide the exact release/tag/kernel/thing you're trying
> >to install or upgrade to ("8.x" is a little vague; there are all sorts
> >of changes that happen between tags).  For example 8.1 is not going to
> >behave the same necessarily as 8.2.
> >
> >3) When you say "ahci probing", are you booting a standard installation
> >CD/DVD/memstick of, say, 8.2?  If so, those won't make use of the
> >AHCI-to-CAM translation layer (and that AHCI code is also different than
> >the native-ATA-AHCI code), so you might try, when booting the system,
> >dropping to the loader prompt and issuing "load ahci.ko" before typing
> >"boot".  See if that helps.  If it does, great, use it (ahci_load="yes"
> >in /boot/loader.conf) permanently (and benefit from things like NCQ
> >too).
> let me forward you an image...
> >4) If it's an Intel ESB2 controller, I believe there were some fixes or
> >identification shims put in place for this in recent RELENG_8, which
> >wouldn't be available in RELENG_8_2 or 8.2-RELEASE CD/DVDs.  I could be
> >remembering the wrong controller though.  Hmm...
> >
> 
> that may be what we are looking for.
> 
> I'll try get more info.

For others: the last few lines in the kernel log are:

acpi_hpet0:  iomem 0xfed0-0xfed003ff on acpi0
acpi_hpet0: vend: 0x8086 rev: 0x1 num: 3 hz: 14318180 opts: legacy_route 64-bit
Timecounter "HPET" frequency 14318180 Hz quality 900
acpi: wakeup code va 0xff848311d000 pa 0x4000
ahc_isa_probe 0: ioport 0xc00 alloc failed

I don't see any indication of AHCI problems here (or AHCI at all).
ahc_isa_probe is for the ahc(4) controller -- Adaptec SCSI.

A verbose boot might be more helpful.

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"