Re: impossible packet length ...

2009-02-08 Thread Danny Braniss
I'm reposting this to hackers, and there is some more info.

> Hi,
> on 2 different servers, running 7.1-stable + zfs, I get this
> error rather frequently:
> 
> Feb  5 17:01:03 warhol-00 kernel: impossible packet length (543383918) from 
> nfs server sunfire:/dist
> Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1936028704) from 
> nfs server sunfire:/dist
> Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1869363744) from 
> nfs server sunfire:/dist
> Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1667787057) from 
> nfs server sunfire:/dist
> Feb  5 17:01:03 warhol-00 kernel: impossible packet length (976040755) from 
> nfs server sunfire:/dist
> Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1953459488) from 
> nfs server sunfire:/dist
> Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1348825156) from 
> nfs server sunfire:/dist
> Feb  5 17:01:03 warhol-00 kernel: impossible packet length (0) from nfs 
> server 
> sunfire:/dist
> Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1647208041) from 
> nfs server sunfire:/dist
> 
> in this case the server is running Freebsd-7.0-stable, but I also get it when 
> the server is a
> netapp.
> 
> is there a connection?
> 
> thanks,
>   danny

going through the logs, after it happened again, I got a glimps of this:

Feb  6 18:00:13 warhol-00.cs.huji.ac.il kernel: bce0: discard frame w/o 
leading ethernet header (len 0 pkt len 0)
Feb  6 18:00:19 klee-05.cs.huji.ac.il kernel: nfs: server warhol-00 not 
responding, timed out
...
Feb  6 19:00:00 warhol-00.cs.huji.ac.il amd[715]: More than a single value for 
/defaults in hesiod.local
Feb  6 19:00:00 warhol-00.cs.huji.ac.il amd[715]: Unknown $ sequence in 
"rhost:=${RHOST};type:=nfsl;fs:=${FS};rfs:=$huldig#^ZM-^KoM- abase"
Feb  6 19:00:00 warhol-00.cs.huji.ac.il kernel: impossible packet length 
(2068989523) from nfs server sunfire:/dist

which seems to point fingers at bce...

danny



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: impossible packet length ...

2009-02-08 Thread Danny Braniss
> 
> --jI8keyz6grp/JLjh
> Content-Type: text/plain; charset=us-ascii
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> On 2009-Feb-08 10:45:13 +0200, Danny Braniss  wrote:
> >Feb  6 18:00:13 warhol-00.cs.huji.ac.il kernel: bce0: discard frame w/o=20
> >leading ethernet header (len 0 pkt len 0)
> =2E..
> >Feb  6 19:00:00 warhol-00.cs.huji.ac.il amd[715]: Unknown $ sequence in=20
> >"rhost:=3D${RHOST};type:=3Dnfsl;fs:=3D${FS};rfs:=3D$huldig#^ZM-^KoM- a=
> base"
> >Feb  6 19:00:00 warhol-00.cs.huji.ac.il kernel: impossible packet length=
> =20
> >(2068989523) from nfs server sunfire:/dist
> >
> >which seems to point fingers at bce...
> 
> It does rather suggest that bce is not behaving.  What happens if you
> turn off checksum off-loading?  This should make the kernel drop the
> corrupt packets instead of trying to process them.  If practical, you
> could also try (temporarily) plugging in a different NIC.
> 
I have, and now it's a matter of waiting...
Q: with rxcsum on, and a bad checksum packet is received, is it
   dropped by the NIC? if not, then it somewhat explains the behaviour

changing the nic is tough, but if needed will be done. 
danny

> Peter Jeremy

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: impossible packet length ...

2009-02-08 Thread Peter Jeremy
On 2009-Feb-08 11:31:45 +0200, Danny Braniss  wrote:
>Q: with rxcsum on, and a bad checksum packet is received, is it
>   dropped by the NIC? if not, then it somewhat explains the behaviour

If checksum offloading is working correctly then a bad packet should
be dropped by the NIC.  If checksum offloading isn't working correctly
then you can wind up in the situation where both the NIC and the
driver think the other party has verified the checksum.  It's also
possible that you may be running into corruption during DMA transfer
from the NIC to RAM.  ISTR there have been some issues reported
recently with checksum offloading on some NICs - though I don't have
details to hand - you might like to search the lists.

>changing the nic is tough, but if needed will be done. 

If disabling checksum offloading fixes the problem and the additional
CPU load is acceptable (at least until you find a real fix) then
there's no need to change NICs.

-- 
Peter Jeremy


pgpvjmeZt076h.pgp
Description: PGP signature


Re: impossible packet length ...

2009-02-08 Thread Kostik Belousov
On Sun, Feb 08, 2009 at 10:45:13AM +0200, Danny Braniss wrote:
> I'm reposting this to hackers, and there is some more info.
> 
> > Hi,
> > on 2 different servers, running 7.1-stable + zfs, I get this
> > error rather frequently:
> > 
> > Feb  5 17:01:03 warhol-00 kernel: impossible packet length (543383918) from 
> > nfs server sunfire:/dist
> > Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1936028704) 
> > from 
> > nfs server sunfire:/dist
> > Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1869363744) 
> > from 
> > nfs server sunfire:/dist
> > Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1667787057) 
> > from 
> > nfs server sunfire:/dist
> > Feb  5 17:01:03 warhol-00 kernel: impossible packet length (976040755) from 
> > nfs server sunfire:/dist
> > Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1953459488) 
> > from 
> > nfs server sunfire:/dist
> > Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1348825156) 
> > from 
> > nfs server sunfire:/dist
> > Feb  5 17:01:03 warhol-00 kernel: impossible packet length (0) from nfs 
> > server 
> > sunfire:/dist
> > Feb  5 17:01:03 warhol-00 kernel: impossible packet length (1647208041) 
> > from 
> > nfs server sunfire:/dist
> > 
> > in this case the server is running Freebsd-7.0-stable, but I also get it 
> > when 
> > the server is a
> > netapp.
> > 
> > is there a connection?
> > 
> > thanks,
> > danny
> 
> going through the logs, after it happened again, I got a glimps of this:
> 
> Feb  6 18:00:13 warhol-00.cs.huji.ac.il kernel: bce0: discard frame w/o 
> leading ethernet header (len 0 pkt len 0)
> Feb  6 18:00:19 klee-05.cs.huji.ac.il kernel: nfs: server warhol-00 not 
> responding, timed out
> ...
> Feb  6 19:00:00 warhol-00.cs.huji.ac.il amd[715]: More than a single value 
> for 
> /defaults in hesiod.local
> Feb  6 19:00:00 warhol-00.cs.huji.ac.il amd[715]: Unknown $ sequence in 
> "rhost:=${RHOST};type:=nfsl;fs:=${FS};rfs:=$huldig#^ZM-^KoM- abase"
> Feb  6 19:00:00 warhol-00.cs.huji.ac.il kernel: impossible packet length 
> (2068989523) from nfs server sunfire:/dist
> 
> which seems to point fingers at bce...

bce(4) is broken in stable, your best option is to revert to the
driver in releng 7.1.


pgpYYfqRaG8TK.pgp
Description: PGP signature


Possible VFS KPI and KBI breakage on stable/7

2009-02-08 Thread Kostik Belousov
There are three sets of changes that would benefit stable/7.
Namely, there are

1. Improvements for the UFS unmount or rw->ro remount, that perform
   suspension during the operation.

   The changes depend on the the suspension mechanism path,
   that introduced the suspension owner, and added new VFS OP
   into the mount method table.

   This might also fix the hangs with gjournal or gjournal together
   with snapshots experienced by some users.

   Since the only real consumer of the suspension is UFS, I believe that MFC
   would have quite low impact, if any.

   Corresponding revision is 183073.

2. The openat(2) and similar syscalls. The new ZFS requires openat()
   functionality. We have to change struct nameidata to merge NDINIT_ATVP().

   All modules using namei() need to be recompiled.

3. The Marcus' work on vn_fullpath() support for synthetic filesystems
   introduces new VOP, vop_vptocnp.

   This would allow procstat(1) to work on devfs and pseudofs vnodes.
   As I understand, this would also improve Gnome experience on FreeBSD.

   All fs modules need to be recompiled.

There was one very magisterial voice that objected against KBI breakage
on stable branch in principle. In my opinion, the benefits of the bug
fixes and functionality improvements with the proposed merges are much
greater then inconvenience of the need to recompile out-of-tree fs
modules. Changes were discussed with re@ to some extent.

In case there is vocal objection against the merge, I would abstain
from doing this.


pgpoffmkW1qOg.pgp
Description: PGP signature


Re: impossible packet length ...

2009-02-08 Thread Robert Watson

On Sun, 8 Feb 2009, Peter Jeremy wrote:


On 2009-Feb-08 11:31:45 +0200, Danny Braniss  wrote:

Q: with rxcsum on, and a bad checksum packet is received, is it
  dropped by the NIC? if not, then it somewhat explains the behaviour


If checksum offloading is working correctly then a bad packet should be 
dropped by the NIC.  If checksum offloading isn't working correctly then you 
can wind up in the situation where both the NIC and the driver think the 
other party has verified the checksum.  It's also possible that you may be 
running into corruption during DMA transfer from the NIC to RAM.  ISTR there 
have been some issues reported recently with checksum offloading on some 
NICs - though I don't have details to hand - you might like to search the 
lists.



changing the nic is tough, but if needed will be done.


If disabling checksum offloading fixes the problem and the additional CPU 
load is acceptable (at least until you find a real fix) then there's no need 
to change NICs.


Actually, my understanding was that packets with bad checksums are delivered 
to software, and flag the descriptor ring header for each packet tells us 
whether the checksum was (a) checked and (b) validated by the hardware.  We 
then propagate these to mbuf flags so that higher stack layers know whether or 
not to calculate the checksum themselves.  Regardless of the specifics, 
though, packets with checked but bad checksums shouldn't make it to the socket 
layer where they would be visible to NFS.  If the NIC is marking apparently 
bad packets as good, there are a number of possible sources -- be it bad 
checksum handling in the card, corruption between the card and higher levels 
of the stack (a DMA problem, as you point out, would have this symptom).


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Big problems with 7.1 locking up :-(

2009-02-08 Thread Pete French
> load.  Kip Macy has corrected at least one (both?) problems in head, and
> plans to MFC the fixes in the near future.  We'll follow up further once
> the fixes are merged, and if any further problems transpire.

Hi, just wondering if we are any closer to having the MFC for this yet, or
if there are any patches I could test ?

cheers,

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: impossible packet length ...

2009-02-08 Thread Danny Braniss
> On Sun, 8 Feb 2009, Peter Jeremy wrote:
> 
> > On 2009-Feb-08 11:31:45 +0200, Danny Braniss  wrote:
> >> Q: with rxcsum on, and a bad checksum packet is received, is it
> >>   dropped by the NIC? if not, then it somewhat explains the behaviour
> >
> > If checksum offloading is working correctly then a bad packet should be 
> > dropped by the NIC.  If checksum offloading isn't working correctly then 
> > you 
> > can wind up in the situation where both the NIC and the driver think the 
> > other party has verified the checksum.  It's also possible that you may be 
> > running into corruption during DMA transfer from the NIC to RAM.  ISTR 
> > there 
> > have been some issues reported recently with checksum offloading on some 
> > NICs - though I don't have details to hand - you might like to search the 
> > lists.
> >
> >> changing the nic is tough, but if needed will be done.
> >
> > If disabling checksum offloading fixes the problem and the additional CPU 
> > load is acceptable (at least until you find a real fix) then there's no 
> > need 
> > to change NICs.
> 
> Actually, my understanding was that packets with bad checksums are delivered 
> to software, and flag the descriptor ring header for each packet tells us 
> whether the checksum was (a) checked and (b) validated by the hardware.  We 
> then propagate these to mbuf flags so that higher stack layers know whether 
> or 
> not to calculate the checksum themselves.  Regardless of the specifics, 
> though, packets with checked but bad checksums shouldn't make it to the 
> socket 
> layer where they would be visible to NFS.  If the NIC is marking apparently 
> bad packets as good, there are a number of possible sources -- be it bad 
> checksum handling in the card, corruption between the card and higher levels 
> of the stack (a DMA problem, as you point out, would have this symptom).

looking at the bce source, it's not clear (to me :-). If errors are detected in
bce_rx_intr(), the packet gets dropped, which I would expect to be the 
treatment
of an offloded chekcum error, but it seems that is not the case. 

danny



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: impossible packet length ...

2009-02-08 Thread Robert Watson


On Sun, 8 Feb 2009, Danny Braniss wrote:

looking at the bce source, it's not clear (to me :-). If errors are detected 
in bce_rx_intr(), the packet gets dropped, which I would expect to be the 
treatment of an offloded chekcum error, but it seems that is not the case.


I think we're thinking of different checksums -- devices/device drivers drop 
frames with bad ethernet checksums, but not IP and above layer checksums.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: impossible packet length ...

2009-02-08 Thread Peter Jeremy
On 2009-Feb-08 10:45:13 +0200, Danny Braniss  wrote:
>Feb  6 18:00:13 warhol-00.cs.huji.ac.il kernel: bce0: discard frame w/o 
>leading ethernet header (len 0 pkt len 0)
...
>Feb  6 19:00:00 warhol-00.cs.huji.ac.il amd[715]: Unknown $ sequence in 
>"rhost:=${RHOST};type:=nfsl;fs:=${FS};rfs:=$huldig#^ZM-^KoM- abase"
>Feb  6 19:00:00 warhol-00.cs.huji.ac.il kernel: impossible packet length 
>(2068989523) from nfs server sunfire:/dist
>
>which seems to point fingers at bce...

It does rather suggest that bce is not behaving.  What happens if you
turn off checksum off-loading?  This should make the kernel drop the
corrupt packets instead of trying to process them.  If practical, you
could also try (temporarily) plugging in a different NIC.

-- 
Peter Jeremy


pgpuJJeSAGTcl.pgp
Description: PGP signature


Re: impossible packet length ...

2009-02-08 Thread Danny Braniss
> 
> On Sun, 8 Feb 2009, Danny Braniss wrote:
> 
> > looking at the bce source, it's not clear (to me :-). If errors are 
> > detected 
> > in bce_rx_intr(), the packet gets dropped, which I would expect to be the 
> > treatment of an offloded chekcum error, but it seems that is not the case.
> 
> I think we're thinking of different checksums -- devices/device drivers drop 
> frames with bad ethernet checksums, but not IP and above layer checksums.

I know I'm stepping on thin ice hear - haven't touched Stevens for a while,
(and I doubt it mentions offloading), but if the offload checksum is bad,
why not just drop the packet?

The way I read the driver, if the offload checksum is on, and if no
errors where detected, then it's marked as ok.

danny


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: impossible packet length ...

2009-02-08 Thread Robert Watson


On Sun, 8 Feb 2009, Danny Braniss wrote:


On Sun, 8 Feb 2009, Danny Braniss wrote:

looking at the bce source, it's not clear (to me :-). If errors are 
detected in bce_rx_intr(), the packet gets dropped, which I would expect 
to be the treatment of an offloded chekcum error, but it seems that is not 
the case.


I think we're thinking of different checksums -- devices/device drivers 
drop frames with bad ethernet checksums, but not IP and above layer 
checksums.


I know I'm stepping on thin ice hear - haven't touched Stevens for a while, 
(and I doubt it mentions offloading), but if the offload checksum is bad, 
why not just drop the packet?


The way I read the driver, if the offload checksum is on, and if no errors 
where detected, then it's marked as ok.


There are a few good reasons I can think of, but this is hardly a 
comprehensive list:


(1) If there are bad higher level checksums on the wire, you want to see them
in tcpdump, so allow them to get up to a higher layer if network layer
checksums aren't good.

(2) It's a matter of local policy as to whether UDP checksums (for example)
are observed or not.

(3) If you're forwarding or bridging packets, it should be up to the end nodes
how they deal with bad UDP checksums on packets to them, not the routers.

Looking at if_bce.c, the following seems to be reasonable logic; first, 
ethernet-layer checksums:


5902 /* Check the received frame for errors. */
5903 if (status & (L2_FHDR_ERRORS_BAD_CRC |
5904 L2_FHDR_ERRORS_PHY_DECODE | 
L2_FHDR_ERRORS_ALIGNMENT |
5905 L2_FHDR_ERRORS_TOO_SHORT  | 
L2_FHDR_ERRORS_GIANT_FRAME)) {

5906
5907 /* Log the error and release the mbuf. */
5908 ifp->if_ierrors++;
5909 DBRUN(sc->l2fhdr_status_errors++);
5910
5911 m_freem(m0);
5912 m0 = NULL;
5913 goto bce_rx_int_next_rx;
5914 }

I.e., if there are ethernet-level CRC failures, drop the packet.

5922 /* Validate the checksum if offload enabled. */
5923 if (ifp->if_capenable & IFCAP_RXCSUM) {
5924
5925 /* Check for an IP datagram. */
5926 if (!(status & L2_FHDR_STATUS_SPLIT) &&
5927 (status & L2_FHDR_STATUS_IP_DATAGRAM)) {
5928 m0->m_pkthdr.csum_flags |= 
CSUM_IP_CHECKED;

5929
5930 /* Check if the IP checksum is valid. */
5931 if ((l2fhdr->l2_fhdr_ip_xsum ^ 0x) == 
0)
5932 m0->m_pkthdr.csum_flags |= 
CSUM_IP_VALID;

5933 }
5934
5935 /* Check for a valid TCP/UDP frame. */
5936 if (status & (L2_FHDR_STATUS_TCP_SEGMENT |
5937 L2_FHDR_STATUS_UDP_DATAGRAM)) {
5938
5939 /* Check for a good TCP/UDP checksum. */
5940 if ((status & (L2_FHDR_ERRORS_TCP_XSUM |
5941   L2_FHDR_ERRORS_UDP_XSUM)) 
== 0) {

5942 m0->m_pkthdr.csum_data =
5943 l2fhdr->l2_fhdr_tcp_udp_xsum;
5944 m0->m_pkthdr.csum_flags |= 
(CSUM_DATA_VALID

5945 | CSUM_PSEUDO_HDR);
5946 }
5947 }
5948 }

Only look at higher level checksums if policy enables it on the interface; 
then, only if the hardware has a view on the IP-layer checksums, propagte that 
information to the mbuf flags from the descriptor ring entry flags, both 
whether or not the checksum was verified, and whether or not it was good.  If 
policy disables it, or the hardware expresses no view, we don't set flags, 
which simply defers checksumming to a higher layer (if required -- for 
forwarded packets, we won't test UDP-layer checksums at all).


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Big problems with 7.1 locking up :-(

2009-02-08 Thread Stefan Lambrev

Hi all,

In this thread someone mention a problem with soekris devices.
I personally have one of those new soekris devices and installed 7.1R  
and it is very easy to freeze it.
All that I have to do is to copy big file vfer WIFI (atheros) with  
speed higher then 1-2MB/s.
It takes less then 2 minutes to freeze. I wonder if there is some  
improvement
in 7.1-stable so I can try it or if I can help by compiling debug  
kernel?
But I'm not sure if this is the same problem as it may be just the  
wireless driver in my case.


On Feb 8, 2009, at 3:11 PM, Pete French wrote:

load.  Kip Macy has corrected at least one (both?) problems in  
head, and
plans to MFC the fixes in the near future.  We'll follow up further  
once

the fixes are merged, and if any further problems transpire.


Hi, just wondering if we are any closer to having the MFC for this  
yet, or

if there are any patches I could test ?

cheers,

-pete.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org 
"


--
Best Wishes,
Stefan Lambrev
ICQ# 24134177





___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Big problems with 7.1 locking up :-(

2009-02-08 Thread cpghost
On Sun, Feb 08, 2009 at 05:11:02PM +0200, Stefan Lambrev wrote:
> Hi all,
> 
> In this thread someone mention a problem with soekris devices.
> I personally have one of those new soekris devices and installed 7.1R  
> and it is very easy to freeze it.
> All that I have to do is to copy big file vfer WIFI (atheros) with  
> speed higher then 1-2MB/s.
> It takes less then 2 minutes to freeze. I wonder if there is some  
> improvement
> in 7.1-stable so I can try it or if I can help by compiling debug  
> kernel?
> But I'm not sure if this is the same problem as it may be just the  
> wireless driver in my case.

One some net4801's without WIFI, I also experience frequent
freezes after a couple of hours up to 2-5 days... so it's
probably not only ath related.

What's your kern.hz value? In my /boot/loader.conf, it is set
to 100. Could you try it too, and see if you can still freeze
the box (just to rule out some weird timing / interrupt issue)?

> Best Wishes,
> Stefan Lambrev
> ICQ# 24134177

Regards,
-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Big problems with 7.1 locking up :-(

2009-02-08 Thread Mike Tancsa

At 10:11 AM 2/8/2009, Stefan Lambrev wrote:

Hi all,

In this thread someone mention a problem with soekris devices.
I personally have one of those new soekris devices and installed 7.1R
and it is very easy to freeze it.
All that I have to do is to copy big file vfer WIFI (atheros) with
speed higher then 1-2MB/s.



Try and copy across the ethernet.   I have several RELENG_7 boxes 
deployed on soekris and Alix boards (same chipset pretty well) and 
have not seen any stability issues.



---Mike 


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Broken loader on 7.1-STABLE?

2009-02-08 Thread Mark Kirkwood

Mark Kirkwood wrote:


 ...specifying /boot/loader.old got me booted ok (not sure why this 
*didn't* work with the Asus, maybe I need to try it again with the Feb 
sources).





I tried the latest RELENG_7 sources, same result - does *not* boot even 
specifying the old loader. I spent a bit of time narrowing down why. I'd 
previously noted that an empty loader.conf was sufficient to get it to 
boot again. After some experimentation I discovered that this line in 
loader.conf:


sound_load="YES"

made the boot with the old loader fail (loading the sound module after 
booting seems to work ok).


The box is an Asus a8vx with amd64 x2 3800+, running i386.

regards

Mark
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Unhappy Xorg upgrade

2009-02-08 Thread Bruce M. Simpson

Bruce M. Simpson wrote:

S.N.Grigoriev wrote:

I thank you for your response. I've applied the patch to pci.c from
kern/130957. Unfortunately there are no positive results. USB is still
unreachable with X.


Just following up to confirm that you are seeing exactly the same 
symptoms with USB and Xorg 7.4 as I see on my amd64 desktop running 
7-STABLE from 00:00 UTC on this Wednesday.


I still see the USB symptoms with xorg-server port as of today -- forced 
rebuild with libpciaccess also. So amd64 is still regressed -- USB is 
totally unusable there after X is started. My theory was that somehow 
Xorg was stomping on the USB controller registers on this machine. The 
USB controller on this box is ALi, card=0x81561043.


My i386 laptop (IBM/Lenovo T43) is not affected, and USB mice work just 
fine there.


Obviously it's difficult to check what Xorg is actually doing to the 
registers on the box w/o a PCI bus analyzer, and of course due to normal 
decoding, those cycles probably won't be seen on the backplane itself as 
it sits behind a bridge; I haven't fully read what libpciaccess is doing.


I skimmed patch-src-freebsd_pci.c. I wonder if this code may be stomping 
on the USB controller in some way (i.e. how it frobs the BARs).


According to src/tools/tools/pciroms, the only PCI devices on this box 
with ROM BARs are mskc0 and vgapci0.


(I also wonder if it's possible to guarantee that the window at 0xC 
is always going to be available, even in the amd64 case.)


cheers
BMS
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: impossible packet length ...

2009-02-08 Thread Eric Anderson

On Feb 8, 2009, at 3:31 AM, Danny Braniss wrote:



--jI8keyz6grp/JLjh
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2009-Feb-08 10:45:13 +0200, Danny Braniss   
wrote:
Feb  6 18:00:13 warhol-00.cs.huji.ac.il kernel: bce0: discard  
frame w/o=20

leading ethernet header (len 0 pkt len 0)

=2E..
Feb  6 19:00:00 warhol-00.cs.huji.ac.il amd[715]: Unknown $  
sequence in=20
"rhost:=3D${RHOST};type:=3Dnfsl;fs:=3D${FS};rfs:=3D$huldig#^ZM- 
^KoM- a=

base"
Feb  6 19:00:00 warhol-00.cs.huji.ac.il kernel: impossible packet  
length=

=20

(2068989523) from nfs server sunfire:/dist

which seems to point fingers at bce...


It does rather suggest that bce is not behaving.  What happens if you
turn off checksum off-loading?  This should make the kernel drop the
corrupt packets instead of trying to process them.  If practical, you
could also try (temporarily) plugging in a different NIC.


I have, and now it's a matter of waiting...
Q: with rxcsum on, and a bad checksum packet is received, is it
  dropped by the NIC? if not, then it somewhat explains the behaviour

changing the nic is tough, but if needed will be done.
danny


Peter Jeremy



We were hitting this quite a bit (also bce), and updated to a recent 7- 
branch and it seems to be behaving better for now.  Running 12 days so  
far (which is better than what we had been seeing).


Eric






___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


sysctl lock in RELENG_6

2009-02-08 Thread Eugene Grosbein
Hi!

I've RELENG_6 system controlling local PBX through RS-232 port, sio(4).
It also runs syslogd, cron, sshd, bsnmpd and sendmail for outgoing reports.

It locks very often: it answers to pings but PBX controlling software stops
responding, local and remote login attempts hang due to 'login' process
stuck in 'sysctl lock' state. Local consoles do switch with 'Alt-Fn'
and DDB works. It shows that sendmail is in 'sysctl lock' state too.

This is NanoBSD installation running from IDE flash, it's swapless
but I think I could manage to obtain crashdump if there is an interest of it.

I've digged commit logs a bit and found this change MFC'd to RELENG_7
but not RELENG_6:

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_sysctl.c#rev1.177.6.2

It seems RELENG_6 needs this too, doesn't it?
I'm going to merge the change to RELENG_6 and give it a try.

Eugene Grosbein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Unhappy Xorg upgrade

2009-02-08 Thread Robert Noland
On Mon, 2009-02-09 at 02:08 +, Bruce M. Simpson wrote:
> Bruce M. Simpson wrote:
> > S.N.Grigoriev wrote:
> >> I thank you for your response. I've applied the patch to pci.c from
> >> kern/130957. Unfortunately there are no positive results. USB is still
> >> unreachable with X.
> >
> > Just following up to confirm that you are seeing exactly the same 
> > symptoms with USB and Xorg 7.4 as I see on my amd64 desktop running 
> > 7-STABLE from 00:00 UTC on this Wednesday.
> 
> I still see the USB symptoms with xorg-server port as of today -- forced 
> rebuild with libpciaccess also. So amd64 is still regressed -- USB is 
> totally unusable there after X is started. My theory was that somehow 
> Xorg was stomping on the USB controller registers on this machine. The 
> USB controller on this box is ALi, card=0x81561043.
> 
> My i386 laptop (IBM/Lenovo T43) is not affected, and USB mice work just 
> fine there.
> 
> Obviously it's difficult to check what Xorg is actually doing to the 
> registers on the box w/o a PCI bus analyzer, and of course due to normal 
> decoding, those cycles probably won't be seen on the backplane itself as 
> it sits behind a bridge; I haven't fully read what libpciaccess is doing.
> 
> I skimmed patch-src-freebsd_pci.c. I wonder if this code may be stomping 
> on the USB controller in some way (i.e. how it frobs the BARs).

Until last night, it only probed pci resources for pci class DISPLAY
subclass VGA.  The rom reading was restricted to 0xc/0x1, which
it mmap and copied out to a userland buffer.

As of last night, I committed the code that actually checks for a pci
rom.  If it finds one, it uses those values (base address, length) to
mmap the bios for copy.  If it doesn't find a pci rom, (most IGDs
(intel, via, sis) it just uses the 0xc mapping as it did before if
it is i386 or amd64.  Otherwise, bios reading just fails.

robert.

> According to src/tools/tools/pciroms, the only PCI devices on this box 
> with ROM BARs are mskc0 and vgapci0.
> 
> (I also wonder if it's possible to guarantee that the window at 0xC 
> is always going to be available, even in the amd64 case.)
> 
> cheers
> BMS
-- 
Robert Noland 
FreeBSD


signature.asc
Description: This is a digitally signed message part


Re: Unhappy Xorg upgrade

2009-02-08 Thread Robert Noland
On Mon, 2009-02-09 at 02:08 +, Bruce M. Simpson wrote:
> Bruce M. Simpson wrote:
> > S.N.Grigoriev wrote:
> >> I thank you for your response. I've applied the patch to pci.c from
> >> kern/130957. Unfortunately there are no positive results. USB is still
> >> unreachable with X.
> >
> > Just following up to confirm that you are seeing exactly the same 
> > symptoms with USB and Xorg 7.4 as I see on my amd64 desktop running 
> > 7-STABLE from 00:00 UTC on this Wednesday.
> 
> I still see the USB symptoms with xorg-server port as of today -- forced 
> rebuild with libpciaccess also. So amd64 is still regressed -- USB is 
> totally unusable there after X is started. My theory was that somehow 
> Xorg was stomping on the USB controller registers on this machine. The 
> USB controller on this box is ALi, card=0x81561043.

Is your usb sharing interrupts with the video card?

Does the issue occur if you aren't using a usb mouse?

robert.

> My i386 laptop (IBM/Lenovo T43) is not affected, and USB mice work just 
> fine there.
> 
> Obviously it's difficult to check what Xorg is actually doing to the 
> registers on the box w/o a PCI bus analyzer, and of course due to normal 
> decoding, those cycles probably won't be seen on the backplane itself as 
> it sits behind a bridge; I haven't fully read what libpciaccess is doing.
> 
> I skimmed patch-src-freebsd_pci.c. I wonder if this code may be stomping 
> on the USB controller in some way (i.e. how it frobs the BARs).
> 
> According to src/tools/tools/pciroms, the only PCI devices on this box 
> with ROM BARs are mskc0 and vgapci0.
> 
> (I also wonder if it's possible to guarantee that the window at 0xC 
> is always going to be available, even in the amd64 case.)
> 
> cheers
> BMS
-- 
Robert Noland 
FreeBSD


signature.asc
Description: This is a digitally signed message part


7.1 Panic on degraded disk w/mpt

2009-02-08 Thread Charles Sprickman

Howdy,

I dug around and can't find a PR on this, and the only other report I saw 
was in this mailing list post that has no replies:


http://www.nabble.com/7.1-BETA2-panic-on-mpt-degrade-td20183173.html

The hardware is a Dell PowerEdge 860 with the Dell/LSI SAS5 controller:
mpt0:  port 0xec00-0xecff mem 
0xfe9fc000-0xfe9f,0xfe9e-0xfe9e irq 16 at device 8.0 on pci2

mpt0: MPI Version=1.5.13.0

The panic is repeatable by forcing the array into a degraded state.

Here's my best shot at getting info out of kgdb:

[r...@uniweb /home/spork]# cd /usr/obj/usr/src/sys/BWAY7/
[r...@uniweb /usr/obj/usr/src/sys/BWAY7]# kgdb kernel.debug 
/var/crash/vmcore.0 GNU gdb 6.1.1 [FreeBSD]

Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you 
are welcome to change it and/or distribute copies of it under certain 
conditions.

Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for 
details.

This GDB was configured as "i386-marcel-freebsd"...

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x14
fault code  = supervisor read, page not present
instruction pointer = 0x20:0xc044b09b
stack pointer   = 0x28:0xe6ee5b80
frame pointer   = 0x28:0xe6ee5b9c
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 17 (swi2: cambio)
trap number = 12
panic: page fault
cpuid = 0
Uptime: 3m7s
Physical memory: 3575 MB
Dumping 94 MB: 79 63 47 31 15

Reading symbols from /boot/kernel/acpi.ko...Reading symbols from 
/boot/kernel/acpi.ko.symbols...done.

done.
Loaded symbols for /boot/kernel/acpi.ko
#0  doadump () at pcpu.h:196
196 __asm __volatile("movl %%fs:0,%0" : "=r" (td));
(kgdb) list *0xc044b09b
0xc044b09b is in xpt_done (/usr/src/sys/cam/cam_xpt.c:4832).
4827if ((done_ccb->ccb_h.func_code & XPT_FC_QUEUED) != 0) {
4828/*
4829 * Queue up the request for handling by our SWI handler
4830 * any of the "non-immediate" type of ccbs.
4831 */
4832sim = done_ccb->ccb_h.path->bus->sim;
4833switch (done_ccb->ccb_h.path->periph->type) {
4834case CAM_PERIPH_BIO:
4835TAILQ_INSERT_TAIL(&sim->sim_doneq, 
&done_ccb->ccb_h,
4836  sim_links.tqe);

(kgdb) backtrace
#0  doadump () at pcpu.h:196
#1  0xc061d0f7 in boot (howto=260) at 
/usr/src/sys/kern/kern_shutdown.c:418

#2  0xc061d3c9 in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:574
#3  0xc0865fcc in trap_fatal (frame=0xe6ee5b40, eva=20)
at /usr/src/sys/i386/i386/trap.c:939
#4  0xc0866230 in trap_pfault (frame=0xe6ee5b40, usermode=0, eva=20)
at /usr/src/sys/i386/i386/trap.c:852
#5  0xc0866bc2 in trap (frame=0xe6ee5b40) at 
/usr/src/sys/i386/i386/trap.c:530

#6  0xc084d45b in calltrap () at /usr/src/sys/i386/i386/exception.s:159
#7  0xc044b09b in xpt_done (done_ccb=0xc6bf5000)
at /usr/src/sys/cam/cam_xpt.c:4832
#8  0xc044eee9 in xpt_scan_bus (periph=0xc6984b00, request_ccb=0xc6bf5000)
at /usr/src/sys/cam/cam_xpt.c:5395
#9  0xc044d241 in camisr_runqueue (V_queue=Variable "V_queue" is not 
available.

) at /usr/src/sys/cam/cam_xpt.c:7316
#10 0xc044d39e in camisr (dummy=0x0) at /usr/src/sys/cam/cam_xpt.c:7216
#11 0xc05fb41b in ithread_loop (arg=0xc699d770)
at /usr/src/sys/kern/kern_intr.c:1088
#12 0xc05f7f69 in fork_exit (callout=0xc05fb260 ,
arg=0xc699d770, frame=0xe6ee5d38) at /usr/src/sys/kern/kern_fork.c:810
#13 0xc084d4d0 in fork_trampoline () at 
/usr/src/sys/i386/i386/exception.s:264


I can supply dmesg, more info, make it crash more, etc.  I suspect it will 
panic again when the rebuild completes, I'll capture that one as well.


Please let me know how to proceed - I can open a PR if this is truly a 
bug, or bring it over to freebsd-scsi if more appropriate.


Thanks,

Charles

___
Charles Sprickman
NetEng/SysAdmin
Bway.net - New York's Best Internet - www.bway.net
sp...@bway.net - 212.655.9344

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: impossible packet length ...

2009-02-08 Thread Danny Braniss
> 
> On Sun, 8 Feb 2009, Danny Braniss wrote:
> 
> >> On Sun, 8 Feb 2009, Danny Braniss wrote:
> >>
> >>> looking at the bce source, it's not clear (to me :-). If errors are 
> >>> detected in bce_rx_intr(), the packet gets dropped, which I would expect 
> >>> to be the treatment of an offloded chekcum error, but it seems that is 
> >>> not 
> >>> the case.
> >>
> >> I think we're thinking of different checksums -- devices/device drivers 
> >> drop frames with bad ethernet checksums, but not IP and above layer 
> >> checksums.
> >
> > I know I'm stepping on thin ice hear - haven't touched Stevens for a while, 
> > (and I doubt it mentions offloading), but if the offload checksum is bad, 
> > why not just drop the packet?
> >
> > The way I read the driver, if the offload checksum is on, and if no errors 
> > where detected, then it's marked as ok.
> 
> There are a few good reasons I can think of, but this is hardly a 
> comprehensive list:
> 
> (1) If there are bad higher level checksums on the wire, you want to see them
>  in tcpdump, so allow them to get up to a higher layer if network layer
>  checksums aren't good.
> 
> (2) It's a matter of local policy as to whether UDP checksums (for example)
>  are observed or not.
> 
> (3) If you're forwarding or bridging packets, it should be up to the end nodes
>  how they deal with bad UDP checksums on packets to them, not the routers.

ok, I can understand the logic.

> 
> Looking at if_bce.c, the following seems to be reasonable logic; first, 
> ethernet-layer checksums:
> 
> 5902 /* Check the received frame for errors. */
> 5903 if (status & (L2_FHDR_ERRORS_BAD_CRC |
> 5904 L2_FHDR_ERRORS_PHY_DECODE | 
> L2_FHDR_ERRORS_ALIGNMENT |
> 5905 L2_FHDR_ERRORS_TOO_SHORT  | 
> L2_FHDR_ERRORS_GIANT_FRAME)) {
> 5906
> 5907 /* Log the error and release the mbuf. */
> 5908 ifp->if_ierrors++;
> 5909 DBRUN(sc->l2fhdr_status_errors++);
> 5910
> 5911 m_freem(m0);
> 5912 m0 = NULL;
> 5913 goto bce_rx_int_next_rx;
> 5914 }
> 
> I.e., if there are ethernet-level CRC failures, drop the packet.
> 
> 5922 /* Validate the checksum if offload enabled. */
> 5923 if (ifp->if_capenable & IFCAP_RXCSUM) {
> 5924
> 5925 /* Check for an IP datagram. */
> 5926 if (!(status & L2_FHDR_STATUS_SPLIT) &&
> 5927 (status & L2_FHDR_STATUS_IP_DATAGRAM)) {
> 5928 m0->m_pkthdr.csum_flags |= 
> CSUM_IP_CHECKED;
> 5929
> 5930 /* Check if the IP checksum is valid. */
> 5931 if ((l2fhdr->l2_fhdr_ip_xsum ^ 0x) 
> == 
> 0)
> 5932 m0->m_pkthdr.csum_flags |= 
> CSUM_IP_VALID;
> 5933 }
> 5934
> 5935 /* Check for a valid TCP/UDP frame. */
> 5936 if (status & (L2_FHDR_STATUS_TCP_SEGMENT |
> 5937 L2_FHDR_STATUS_UDP_DATAGRAM)) {
> 5938
> 5939 /* Check for a good TCP/UDP checksum. */
> 5940 if ((status & (L2_FHDR_ERRORS_TCP_XSUM |
> 5941   L2_FHDR_ERRORS_UDP_XSUM)) 
> == 0) {
> 5942 m0->m_pkthdr.csum_data =
> 5943 l2fhdr->l2_fhdr_tcp_udp_xsum;
> 5944 m0->m_pkthdr.csum_flags |= 
> (CSUM_DATA_VALID
> 5945 | CSUM_PSEUDO_HDR);
> 5946 }
> 5947 }
> 5948 }
> 
> Only look at higher level checksums if policy enables it on the interface; 
> then, only if the hardware has a view on the IP-layer checksums, propagte 
> that 
> information to the mbuf flags from the descriptor ring entry flags, both 
> whether or not the checksum was verified, and whether or not it was good.  If 
> policy disables it, or the hardware expresses no view, we don't set flags, 
> which simply defers checksumming to a higher layer (if required -- for 
> forwarded packets, we won't test UDP-layer checksums at all).

I missed line 5928, and as usual, your explanation is most educational!
The comment in line 5939 is a bit missleading, the way I read the code, it
does not check for good checksum.

Cheers,
danny



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"