RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-26 Thread Justin Piszcz


> [0.696357] ioatdma: Intel(R) QuickData Technology Driver 4.00
> [0.696487] ioatdma :00:04.0: channel error register unreachable
> I assume this is something Supermicro has to fix?

You are probably missing some kernel config option(s) :) - I did fight
similar
issues on a Fujitsu SandyBridge Xeon based server.

Check if enabling CONFIG_X86_X2APIC helps as well as other APIC/IOMMU
options.

Bruno

=> Enabled:
CONFIG_IOMMU_SUPPORT
CONFIG_INTEL_IOMMU
CONFIG_INTEL_IOMMU_DEFAULT_ON
CONFIG_IRQ_REMAP

Also tried enabling NUMA, etc:

[0.330998] ACPI FADT declares the system doesn't support PCIe ASPM, so
disable it
[0.331068] ACPI: bus type pci registered

[0.615234] ACPI: Dynamic OEM Table Load:
[0.615373] ACPI: PRAD   (null) 000BE (v02 PRADID  PRADTID
0001 MSFT 0400)
[0.615631] \_SB_:_OSC invalid UUID
[0.615633] _OSC request data:1 7


[0.663138] pci :ff:13.5: [8086:3c44] type 00 class 0x110100
[0.663170] pci :ff:13.6: [8086:3c45] type 00 class 0x088000
[0.663211]  pci:ff: ACPI _OSC support notification failed, disabling
PCIe ASPM
[0.663281]  pci:ff: Unable to request _OSC control (_OSC support
mask: 0x08)

:(

Justin.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-26 Thread Justin Piszcz


-Original Message-
From: Bjorn Helgaas [mailto:bhelg...@google.com] 
Sent: Monday, November 26, 2012 8:00 PM
To: Bruno Prémont
Cc: Justin Piszcz; supp...@supermicro.com; linux-kernel@vger.kernel.org; Dan
Williams
Subject: Re: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware
bug question

[Try Dan's current email address; sorry Dan]

On Mon, Nov 26, 2012 at 5:56 PM, Bjorn Helgaas  wrote:
> [+cc Dan]
>
> On Mon, Nov 26, 2012 at 2:42 PM, Bruno Prémont
>  wrote:
>> Hi Justin,
>>
>> On Sat, 24 November 2012 "Justin Piszcz" wrote:
>>> Is the following normal on an X9SRL-F board (bios 1.0a)?
>>>
>>> In the manual it states:
>>>
>>> Data Direct I/O
>>> Select Enabled to enable Intel I/OAT (I/O Acceleration Technology),
which
>>> significantly reduces CPU overhead by leveraging CPU architectural
>>> improvements and freeing the system resource for other tasks. The
options
>>> are Disabled and Enabled.
>>>
>>> Default is Enabled.
>>>
>>> When enabled in the kernel, I see the following:
>>>
>>> [0.696357] ioatdma: Intel(R) QuickData Technology Driver 4.00
>>> [0.696487] ioatdma :00:04.0: channel error register unreachable
>>> [0.696546] ioatdma :00:04.0: channel enumeration error
>>> [0.696604] ioatdma :00:04.0: Intel(R) I/OAT DMA Engine init
failed
>>> [0.696721] ioatdma :00:04.1: channel error register unreachable
>>> [0.696779] ioatdma :00:04.1: channel enumeration error
>>> [0.697522] ioatdma :00:04.1: Intel(R) I/OAT DMA Engine init
failed
>>> [0.697617] ioatdma :00:04.2: channel error register unreachable
>>> [0.697681] ioatdma :00:04.2: channel enumeration error
>>> [0.697739] ioatdma :00:04.2: Intel(R) I/OAT DMA Engine init
failed
>>> [0.697831] ioatdma :00:04.3: channel error register unreachable
>>> [0.697890] ioatdma :00:04.3: channel enumeration error
>>> [0.697948] ioatdma :00:04.3: Intel(R) I/OAT DMA Engine init
failed
>>> [0.698037] ioatdma :00:04.4: channel error register unreachable
>>> [0.698095] ioatdma :00:04.4: channel enumeration error
>>> [0.698153] ioatdma :00:04.4: Intel(R) I/OAT DMA Engine init
failed
>>> [0.698245] ioatdma :00:04.5: channel error register unreachable
>>> [0.698303] ioatdma :00:04.5: channel enumeration error
>>> [0.698360] ioatdma :00:04.5: Intel(R) I/OAT DMA Engine init
failed
>>> [0.698449] ioatdma :00:04.6: channel error register unreachable
>>> [0.698508] ioatdma :00:04.6: channel enumeration error
>>> [0.698565] ioatdma :00:04.6: Intel(R) I/OAT DMA Engine init
failed
>>> [0.698676] ioatdma :00:04.7: channel error register unreachable
>>> [0.698735] ioatdma :00:04.7: channel enumeration error
>>> [0.698792] ioatdma :00:04.7: Intel(R) I/OAT DMA Engine init
failed
>>>
>>> --
>>>
>>> Also, I tried using ASPM (enabled in BIOS), but since ACPI Linux query
is
>>> ignored, it fails to work:
>>> [0.562229] [Firmware Bug]: ACPI: BIOS _OSI(Linux) query ignored
>>>
>>> I assume this is something Supermicro has to fix?
>>
>> You are probably missing some kernel config option(s) :) - I did fight
similar
>> issues on a Fujitsu SandyBridge Xeon based server.
>>
>> Check if enabling CONFIG_X86_X2APIC helps as well as other APIC/IOMMU
options.
>
> Changing config options is not a valid fix for error messages like
> this.  We should be able to make the config smarter by adding
> dependencies or something, or else make the driver smart enough to
> give a more useful diagnostic.
>
> The "channel error register unreachable" message indicates that
> pci_read_config_dword() failed.  The register in question
> (IOAT_PCI_CHANERR_INT_OFFSET) is at 0x180, so possibly we don't have
> PCI config accessors for the extended config space (0x100-0xfff).  A
> complete dmesg log should show that.

--

Here is the full dmesg: (I went back to my older kernel, let me know if you
need a dmesg w/ those options enabled)
http://home.comcast.net/~jpiszcz/20121126/dmesg.txt

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.6.8 - CONFIG_IXGBE_HWMON -- how to poll Intel 10GbE temperature?

2012-11-27 Thread Justin Piszcz
Hello,

Which user-space application has support to read the temperature off of the
10GbE card?
Regular lm-sensors does not seem to be picking it up.

$ sensors|grep -e -
radeon-pci-0500
coretemp-isa-
nct6776-isa-0a30

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-27 Thread Justin Piszcz


-Original Message-
From: Bjorn Helgaas [mailto:bhelg...@google.com] 
Sent: Monday, November 26, 2012 8:12 PM
To: Justin Piszcz
Cc: Bruno Prémont; supp...@supermicro.com; linux-kernel@vger.kernel.org; Dan
Williams
Subject: Re: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware
bug question

On Mon, Nov 26, 2012 at 6:00 PM, Justin Piszcz 
wrote:
>
>
> -Original Message-
> From: Bjorn Helgaas [mailto:bhelg...@google.com]
> Sent: Monday, November 26, 2012 8:00 PM
> To: Bruno Prémont
> Cc: Justin Piszcz; supp...@supermicro.com; linux-kernel@vger.kernel.org;
Dan
> Williams
> Subject: Re: Supermicro X9SRL-F - channel enumeration error &
ACPI/firmware
> bug question
>
> [Try Dan's current email address; sorry Dan]
>
> On Mon, Nov 26, 2012 at 5:56 PM, Bjorn Helgaas 
wrote:
>> [+cc Dan]
>>
>> On Mon, Nov 26, 2012 at 2:42 PM, Bruno Prémont
>>  wrote:
>>> Hi Justin,
>>>
>>> On Sat, 24 November 2012 "Justin Piszcz" wrote:
>>>> Is the following normal on an X9SRL-F board (bios 1.0a)?
>>>>
>>>> In the manual it states:
>>>>
>>>> Data Direct I/O
>>>> Select Enabled to enable Intel I/OAT (I/O Acceleration Technology),
> which
>>>> significantly reduces CPU overhead by leveraging CPU architectural
>>>> improvements and freeing the system resource for other tasks. The>
> Here is the full dmesg: (I went back to my older kernel, let me know if
you
> need a dmesg w/ those options enabled)
> http://home.comcast.net/~jpiszcz/20121126/dmesg.txt

It looks like maybe you don't have CONFIG_PCI_MMCONFIG turned on?

Hi,

I have two supermicro boards I am trying this on, I tried this on another
system I have (X8DTH-6F), with all of these options enabled, the system does
not boot.  It cannot talk to the SATA boot drive.

" 5520 chips built in, the X8DTH-6/X8DTH-6F/X8DTH-i/X8DTH-iF offers ..
The Intel I/OAT (I/O Acceleration Technology) significantly reduces CPU
over- head by ..."

When the following options are enabled, the system does not boot:

+CONFIG_HAVE_INTEL_TXT=y
+CONFIG_IOMMU_API=y
+CONFIG_IOMMU_SUPPORT=y
+CONFIG_DMAR_TABLE=y
+CONFIG_INTEL_IOMMU=y
+CONFIG_INTEL_IOMMU_DEFAULT_ON=y
+CONFIG_INTEL_IOMMU_FLOPPY_WA=y

It fails like so:

(Fails to talk to the SSD)
http://home.comcast.net/~jpiszcz/20121127/photo1-resize.jpg

(then, a few moments later: Kernel panic)
http://home.comcast.net/~jpiszcz/20121127/photo2-resize.jpg

With those options disabled, the system boots (and always has booted fine).
Is there a certain combination of parameters that allows I/OAT to be enabled
_and_ allow the system to boot?

Justin.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-27 Thread Justin Piszcz

> It looks like maybe you don't have CONFIG_PCI_MMCONFIG turned on?

===> FOR I/OAT DMA
Latest status, it _appears_ its working on the X9SRL-F now, thank you!

1) Supermicro X9SRL-F (GOOD)
[0.738510] ioatdma: Intel(R) QuickData Technology Driver 4.00
[0.738719] ioatdma :00:04.0: irq 75 for MSI/MSI-X
[0.739088] ioatdma :00:04.1: irq 76 for MSI/MSI-X
[0.739408] ioatdma :00:04.2: irq 77 for MSI/MSI-X
[0.739739] ioatdma :00:04.3: irq 78 for MSI/MSI-X
[0.740040] ioatdma :00:04.4: irq 79 for MSI/MSI-X
[0.740342] ioatdma :00:04.5: irq 80 for MSI/MSI-X
[0.740670] ioatdma :00:04.6: irq 81 for MSI/MSI-X
[0.740971] ioatdma :00:04.7: irq 82 for MSI/MSI-X

It is _not_ working on the:

2) Supermicro X8DTH-F (the boot drive in this system is running off a PCI-e
card, could the IRQ for the I/O controller be getting re-mapped and fail?)--
worse case I can move the SSD from the 6.0gbpa SATA card to the motherboard
and see if that works, but that kind of defeats the purpose of a 6.0gbps
SATA SSD.

(Fails to talk to the SSD)
http://home.comcast.net/~jpiszcz/20121127/photo1-resize.jpg

(then, a few moments later: Kernel panic)
http://home.comcast.net/~jpiszcz/20121127/photo2-resize.jpg

Would be curious if anyone had any suggestions besides removing the
controller card?

--


==> Further issues with the X9SRL-F -- does this board support ASPM or is
this a Linux/ASPM implementation issue?
[0.632170]  pci:ff: ACPI _OSC support notification failed, disabling
PCIe ASPM
[0.632239]  pci:ff: Unable to request _OSC control (_OSC support
mask: 0x08)

Justin.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-27 Thread Justin Piszcz

> It is _not_ working on the:

> 2) Supermicro X8DTH-F (the boot drive in this system is running off a
PCI-e
> card, could the IRQ for the I/O controller be getting re-mapped and
fail?)--
> worse case I can move the SSD from the 6.0gbpa SATA card to the
motherboard
> and see if that works, but that kind of defeats the purpose of a 6.0gbps
> SATA SSD.

When IOMMU is disabled, I/OAT DMA is successful on the second motherboard
(X8DTH-6F).
Specifically:

--- DMA Engine support
[*]   Intel I/OAT DMA support
[*]   Network: TCP receive copy offload   
[*]   Async_tx: Offload support for the async_tx api

When IOMMU/X2APIC is enabled on the X8DTH-6F it fails to boot.
Will keep doing more testing to see if I get anywhere w/regards to the
IOMMU.

Proof of success:

[0.757467] ioatdma: Intel(R) QuickData Technology Driver 4.00
[0.757690] ioatdma :00:16.0: irq 88 for MSI/MSI-X
[0.757948] ioatdma :00:16.1: irq 89 for MSI/MSI-X
[0.758166] ioatdma :00:16.2: irq 90 for MSI/MSI-X
[0.758377] ioatdma :00:16.3: irq 91 for MSI/MSI-X
[0.758577] ioatdma :00:16.4: irq 92 for MSI/MSI-X
[0.758794] ioatdma :00:16.5: irq 93 for MSI/MSI-X
[0.759000] ioatdma :00:16.6: irq 94 for MSI/MSI-X
[0.759214] ioatdma :00:16.7: irq 95 for MSI/MSI-X
[0.759461] ioatdma :80:16.0: irq 96 for MSI/MSI-X
[0.759720] ioatdma :80:16.1: irq 97 for MSI/MSI-X
[0.759963] ioatdma :80:16.2: irq 98 for MSI/MSI-X
[0.760190] ioatdma :80:16.3: irq 99 for MSI/MSI-X
[0.760414] ioatdma :80:16.4: irq 100 for MSI/MSI-X
[0.760630] ioatdma :80:16.5: irq 101 for MSI/MSI-X
[0.760862] ioatdma :80:16.6: irq 102 for MSI/MSI-X
[0.761081] ioatdma :80:16.7: irq 103 for MSI/MSI-X

--


==> Further issues with the X9SRL-F -- does this board support ASPM or is
this a Linux/ASPM implementation issue?
[0.632170]  pci:ff: ACPI _OSC support notification failed, disabling
PCIe ASPM
[0.632239]  pci:ff: Unable to request _OSC control (_OSC support
mask: 0x08)

Justin.







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-27 Thread Justin Piszcz


-Original Message-
From: Justin Piszcz [mailto:jpis...@lucidpixels.com] 
Sent: Tuesday, November 27, 2012 8:56 AM
To: 'Bjorn Helgaas'
Cc: 'Bruno Prémont'; supp...@supermicro.com; linux-kernel@vger.kernel.org;
'Dan Williams'
Subject: RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware
bug question


> It is _not_ working on the:

> 2) Supermicro X8DTH-F (the boot drive in this system is running off a
PCI-e
> card, could the IRQ for the I/O controller be getting re-mapped and
fail?)--
> worse case I can move the SSD from the 6.0gbpa SATA card to the
motherboard
> and see if that works, but that kind of defeats the purpose of a 6.0gbps
> SATA SSD.

When I removed the Highpoint 2-port SATA card and plugged it into the
motherboard, the system boots (plugged the SSD into the motherboard).
So if you use a HIGHPOINT 2-PORT SATA 6.0gbps card, do NOT enable IOMMU or
it will fail to initialize the Highpoint 2-port SATA controller card!
I also tried upgrading the BIOS (of the mobo, no diff)
I also tried just leaving the SATA card in and plugging it into the
motherboard (no diff)
Removed the Highpoint 2-port SATA card and then success, it would be nice to
use that card with IOMMU support though, is it just not compatible
(marvell-problem?) or is a driver bug?  Based on the pictures/etc sent
earlier?

$ dmesg|grep -i iommu
[0.055134] dmar: IOMMU 0: reg_base_addr cfdfe000 ver 1:0 cap
c90780106f0462 ecap f020f6
[0.055396] dmar: IOMMU 1: reg_base_addr fecfe000 ver 1:0 cap
c90780106f0462 ecap f020f6
[0.760665] IOMMU 0 0xcfdfe000: using Queued invalidation
[0.760803] IOMMU 1 0xfecfe000: using Queued invalidation
[0.760937] IOMMU: Setting RMRR:
[0.761102] IOMMU: Setting identity map for device :00:1d.0
[0xbf7ec000 - 0xbf7f]
[0.761329] IOMMU: Setting identity map for device :00:1d.1
[0xbf7ec000 - 0xbf7f]
[0.761542] IOMMU: Setting identity map for device :00:1d.2
[0xbf7ec000 - 0xbf7f]
[0.761758] IOMMU: Setting identity map for device :00:1d.7
[0xbf7ec000 - 0xbf7f]
[0.761974] IOMMU: Setting identity map for device :00:1a.0
[0xbf7ec000 - 0xbf7f]
[0.762190] IOMMU: Setting identity map for device :00:1a.1
[0xbf7ec000 - 0xbf7f]
[0.762407] IOMMU: Setting identity map for device :00:1a.2
[0xbf7ec000 - 0xbf7f]
[0.762620] IOMMU: Setting identity map for device :00:1a.7
[0xbf7ec000 - 0xbf7f]
[0.762816] IOMMU: Setting identity map for device :00:1d.0 [0xec000
- 0xe]
[0.763010] IOMMU: Setting identity map for device :00:1d.1 [0xec000
- 0xe]
[0.763197] IOMMU: Setting identity map for device :00:1d.2 [0xec000
- 0xe]
[0.763382] IOMMU: Setting identity map for device :00:1d.7 [0xec000
- 0xe]
[0.763567] IOMMU: Setting identity map for device :00:1a.0 [0xec000
- 0xe]
[0.763749] IOMMU: Setting identity map for device :00:1a.1 [0xec000
- 0xe]
[0.763934] IOMMU: Setting identity map for device :00:1a.2 [0xec000
- 0xe]
[0.764127] IOMMU: Setting identity map for device :00:1a.7 [0xec000
- 0xe]
[0.764311] IOMMU: Prepare 0-16MiB unity mapping for LPC
[0.764465] IOMMU: Setting identity map for device :00:1f.0 [0x0 -
0xff]

--


==> Further issues with the X9SRL-F -- does this board support ASPM or is
this a Linux/ASPM implementation issue?
[0.632170]  pci:ff: ACPI _OSC support notification failed, disabling
PCIe ASPM
[0.632239]  pci:ff: Unable to request _OSC control (_OSC support
mask: 0x08)

Justin.








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.6.8: dmar: DRHD: handling fault status reg 602

2012-11-27 Thread Justin Piszcz
Hello,

Any idea why this is happening (e.g. why is PTE Read Access not set?)

[   13.204560] dmar: DRHD: handling fault status reg 602
[   13.208078] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0
[   13.208078] DMAR:[fault reason 06] PTE Read access is not set
[   15.777874] dmar: DRHD: handling fault status reg 702
[   15.777879] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0
[   15.777879] DMAR:[fault reason 06] PTE Read access is not set
[   16.100453] dmar: DRHD: handling fault status reg 2
[   16.100458] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0
[   16.100458] DMAR:[fault reason 06] PTE Read access is not set
[   16.141058] dmar: DRHD: handling fault status reg 102
[   16.141062] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0
[   16.141062] DMAR:[fault reason 06] PTE Read access is not set
[   16.210102] dmar: DRHD: handling fault status reg 202
[   16.210111] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0
[   16.210111] DMAR:[fault reason 06] PTE Read access is not set
[   16.918149] ixgbe :86:00.0: eth2: NIC Link is Up 10 Gbps, Flow
Control: RX/TX

This is from:
http://lkml.org/lkml/2012/11/27/263

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.6.8: dmar: DRHD: handling fault status reg 602

2012-11-27 Thread Justin Piszcz

-Original Message-
From: Justin Piszcz [mailto:jpis...@lucidpixels.com] 
Sent: Tuesday, November 27, 2012 10:16 AM
To: linux-kernel@vger.kernel.org
Subject: 3.6.8: dmar: DRHD: handling fault status reg 602

Hello,

Any idea why this is happening (e.g. why is PTE Read Access not set?)

[   13.204560] dmar: DRHD: handling fault status reg 602
[   13.208078] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0
[   13.208078] DMAR:[fault reason 06] PTE Read access is not set
[   15.777874] dmar: DRHD: handling fault status reg 702
[   15.777879] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0
[   15.777879] DMAR:[fault reason 06] PTE Read access is not set
[   16.100453] dmar: DRHD: handling fault status reg 2
[   16.100458] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0
[   16.100458] DMAR:[fault reason 06] PTE Read access is not set
[   16.141058] dmar: DRHD: handling fault status reg 102
[   16.141062] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0
[   16.141062] DMAR:[fault reason 06] PTE Read access is not set
[   16.210102] dmar: DRHD: handling fault status reg 202
[   16.210111] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0
[   16.210111] DMAR:[fault reason 06] PTE Read access is not set
[   16.918149] ixgbe :86:00.0: eth2: NIC Link is Up 10 Gbps, Flow
Control: RX/TX

This is from:
http://lkml.org/lkml/2012/11/27/263

Justin.

--

Hi,

Disregard, appears to be a nouveau bug:
https://bugzilla.redhat.com/show_bug.cgi?id=573173

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-28 Thread Justin Piszcz


-Original Message-
From: Bjorn Helgaas [mailto:bhelg...@google.com] 
Sent: Wednesday, November 28, 2012 6:54 PM
To: Justin Piszcz
Cc: Bruno Prémont; supp...@supermicro.com; linux-kernel@vger.kernel.org; Dan
Williams
Subject: Re: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware
bug question

On Tue, Nov 27, 2012 at 6:49 AM, Justin Piszcz 
wrote:
>
>> It looks like maybe you don't have CONFIG_PCI_MMCONFIG turned on?
>
> ===> FOR I/OAT DMA
> Latest status, it _appears_ its working on the X9SRL-F now, thank you!
>
> 1) Supermicro X9SRL-F (GOOD)
> [0.738510] ioatdma: Intel(R) QuickData Technology Driver 4.00
> [0.738719] ioatdma :00:04.0: irq 75 for MSI/MSI-X
> [0.739088] ioatdma :00:04.1: irq 76 for MSI/MSI-X
> [0.739408] ioatdma :00:04.2: irq 77 for MSI/MSI-X
> [0.739739] ioatdma :00:04.3: irq 78 for MSI/MSI-X
> [0.740040] ioatdma :00:04.4: irq 79 for MSI/MSI-X
> [0.740342] ioatdma :00:04.5: irq 80 for MSI/MSI-X
> [0.740670] ioatdma :00:04.6: irq 81 for MSI/MSI-X
> [0.740971] ioatdma :00:04.7: irq 82 for MSI/MSI-X

Good.  You have two issues, and I'm going to separate them and only
address the first one here.  I opened a bug report [1] against the
IOAT driver.  It should do something more useful when
CONFIG_PCI_MMCONFIG=n so we don't have to debug this again in the
future.  But otherwise, it sounds like this issue is resolved.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=51101

--

Yes--(agree w/ config option) Thank you!

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-28 Thread Justin Piszcz


-Original Message-
From: Bjorn Helgaas [mailto:bhelg...@google.com] 
Sent: Wednesday, November 28, 2012 7:09 PM
To: Justin Piszcz
Cc: Bruno Prémont; supp...@supermicro.com; linux-kernel@vger.kernel.org; Dan
Williams
Subject: Re: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware
bug question

On Tue, Nov 27, 2012 at 7:35 AM, Justin Piszcz 
wrote:
>
>
> -Original Message-
> From: Justin Piszcz [mailto:jpis...@lucidpixels.com]
> Sent: Tuesday, November 27, 2012 8:56 AM
> To: 'Bjorn Helgaas'
> Cc: 'Bruno Prémont'; supp...@supermicro.com; linux-kernel@vger.kernel.org;
> 'Dan Williams'
> Subject: RE: Supermicro X9SRL-F - channel enumeration error &
ACPI/firmware
> bug question
>
>
>> It is _not_ working on the:
>
>> 2) Supermicro X8DTH-F (the boot drive in this system is running off a
> PCI-e
>> card, could the IRQ for the I/O controller be getting re-mapped and
> fail?)--
>> worse case I can move the SSD from the 6.0gbpa SATA card to the
> motherboard
>> and see if that works, but that kind of defeats the purpose of a 6.0gbps
>> SATA SSD.
>
> When I removed the Highpoint 2-port SATA card and plugged it into the
> motherboard, the system boots (plugged the SSD into the motherboard).
> So if you use a HIGHPOINT 2-PORT SATA 6.0gbps card, do NOT enable IOMMU or
> it will fail to initialize the Highpoint 2-port SATA controller card!
> I also tried upgrading the BIOS (of the mobo, no diff)
> I also tried just leaving the SATA card in and plugging it into the
> motherboard (no diff)
> Removed the Highpoint 2-port SATA card and then success, it would be nice
to
> use that card with IOMMU support though, is it just not compatible
> (marvell-problem?) or is a driver bug?  Based on the pictures/etc sent
> earlier?

I would guess this is a core bug, but it's hard to tell without more
information.

If you boot with "intel_iommu=off", I would guess the Highpoint card
would work (this should have the same effect as turning off
CONFIG_INTEL_IOMMU).  I'd like to compare the complete dmesg log for
that boot with the one that fails.

It sounds like it might be hard to collect the log for the failing
case -- you said the boot fails when the Highpoint card is in the
system even if the SSD is connected to the motherboard instead of the
Highpoint card.  The panic in the photo2 image looks like it's just a
failure to mount the root filesystem, which is what I'd expect if we
can't find the SSD.  It seems like we ought to be able to *boot* with
the SSD connected to the motherboard, even if the Highpoint card
doesn't work.  But worst-case, a video of the failing boot might be
enough, especially if you can slow it down with "boot_delay="

--

SUMMARY: Card fails with iommu support in the kernel: (but system does now
boot (3.6.8) with the card in as long as the system disk isn't attached to
it, not sure what was wrong earlier).

It seems to be working now:
=> SSD on motherboard
=> PCI-e card (highpoint in the system but not used, no disks attached)

(After I enabled nouveau, not sure that has anything to do with it) I put
the card in, and it errors as usual but the SSD now on the motherboard it
does boot successfully.  

Here are the errors from the kernel trying to initialize the board with
iommu enabled (retrieved via netconsole) also picture below (w/help from
boot_delay=100 && nouveau enabled):
http://home.comcast.net/~jpiszcz/20121128/highpoint.jpg

Nov 28 19:30:16 p34 [7.771060] ata14.00: qc timeout (cmd 0xa1) 
Nov 28 19:30:16 p34 [8.270153] ata14.00: failed to IDENTIFY (I/O error,
err_mask=0x4) 
Nov 28 19:30:17 p34 [9.073935] ata14: SATA link up 1.5 Gbps (SStatus 113
SControl 300) 
Nov 28 19:30:27 p34 [   19.058915] ata14.00: qc timeout (cmd 0xa1) 
Nov 28 19:30:28 p34 [   19.557885] ata14.00: failed to IDENTIFY (I/O error,
err_mask=0x4) 
Nov 28 19:30:28 p34 [   19.558478] ata14: limiting SATA link speed to 1.5
Gbps 
Nov 28 19:30:29 p34 [   20.363658] ata14: SATA link up 1.5 Gbps (SStatus 113
SControl 310) 
Nov 28 19:30:48 p34 [   39.568234] dmar: DRHD: handling fault status reg 502

Nov 28 19:30:48 p34 [   39.571508] dmar: DMAR:[DMA Read] Request device
[04:00.0] fault addr 0  [   39.571508] DMAR:[fault reason 06] PTE Read
access is not set 
Nov 28 19:30:59 p34 [   50.318146] ata14.00: qc timeout (cmd 0xa1) 
Nov 28 19:30:59 p34 [   50.818061] ata14.00: failed to IDENTIFY (I/O error,
err_mask=0x4) 
Nov 28 19:31:00 p34 [   51.621827] ata14: SATA link up 1.5 Gbps (SStatus 113
SControl 310)

Justin.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-28 Thread Justin Piszcz


-Original Message-
From: Robert Hancock [mailto:hancock...@gmail.com] 
Sent: Wednesday, November 28, 2012 7:35 PM
To: Justin Piszcz
Cc: 'Bjorn Helgaas'; 'Bruno Prémont'; supp...@supermicro.com;
linux-kernel@vger.kernel.org; 'Dan Williams'
Subject: Re: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware
bug question


What does lspci -vv show on that controller? Not sure what actual 
chipset that controller is, but there's a known issue with some Marvell 
6Gbps SATA controllers with DMAR enabled - it seems the device issues 
memory read/write requests from the wrong PCI function ID and the IOMMU 
rightly denies access as the function listed in the requests doesn't 
have any mapping to that memory. I don't think there's presently a 
workaround other than disabling DMAR. We could (and likely should) be 
detecting that device and adding some kind of quirk for it.

That sounds likely...
It is shown below:

Card name: HighPoint Rocket 620 Dual Port SATA 6 Gbps PCI Express 2.0 Host
Adapter

lspci -vv output:

84:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9123 PCIe SATA
6.0 Gb/s controller (rev 11) (prog-if 01 [AHCI 1.0])
  Subsystem: Marvell Technology Group Ltd. 88SE9123 PCIe SATA 6.0 Gb/s
controller
  Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- 
> --
>
>
> ==> Further issues with the X9SRL-F -- does this board support ASPM or is
> this a Linux/ASPM implementation issue?
> [0.632170]  pci:ff: ACPI _OSC support notification failed,
disabling
> PCIe ASPM
> [0.632239]  pci:ff: Unable to request _OSC control (_OSC support
> mask: 0x08)

What's the full dmesg from this machine (or is it already posted somewhere)?

It is now available here:
http://home.comcast.net/~jpiszcz/20121128/dmesg.txt

Justin.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-29 Thread Justin Piszcz


-Original Message-
From: Robert Hancock [mailto:hancock...@gmail.com] 
Sent: Wednesday, November 28, 2012 7:55 PM
To: Justin Piszcz
Cc: Bjorn Helgaas; Bruno Prémont; supp...@supermicro.com;
linux-kernel@vger.kernel.org; Dan Williams
Subject: Re: Supermicro X9SRL-F - channel enumeration error & ACPI/firmware
bug question

On Wed, Nov 28, 2012 at 6:49 PM, Justin Piszcz 
wrote:
>
>
> -Original Message-
> From: Robert Hancock [mailto:hancock...@gmail.com]
> Sent: Wednesday, November 28, 2012 7:35 PM
> To: Justin Piszcz
> Cc: 'Bjorn Helgaas'; 'Bruno Prémont'; supp...@supermicro.com;
> linux-kernel@vger.kernel.org; 'Dan Williams'
> Subject: Re: Supermicro X9SRL-F - channel enumeration error &
ACPI/firmware
> bug question
>
>
> What does lspci -vv show on that controller? Not sure what actual
> chipset that controller is, but there's a known issue with some Marvell
> 6Gbps SATA controllers with DMAR enabled - it seems the device issues
> memory read/write requests from the wrong PCI function ID and the IOMMU
> rightly denies access as the function listed in the requests doesn't
> have any mapping to that memory. I don't think there's presently a
> workaround other than disabling DMAR. We could (and likely should) be
> detecting that device and adding some kind of quirk for it.
>
> That sounds likely...
> It is shown below:
>
> Card name: HighPoint Rocket 620 Dual Port SATA 6 Gbps PCI Express 2.0 Host
> Adapter
>
> lspci -vv output:
>
> 84:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9123 PCIe SATA
> 6.0 Gb/s controller (rev 11) (prog-if 01 [AHCI 1.0])
>   Subsystem: Marvell Technology Group Ltd. 88SE9123 PCIe SATA 6.0 Gb/s
> controller

Yeah, that's one of those controllers I think. But I can't tell from
the bit of the dmesg you posted exactly what's going on. Can you post
a full boot log from having the card installed and some drive attached
(by putting the boot drive on another controller for example)?

>> ==> Further issues with the X9SRL-F -- does this board support ASPM or is
>> this a Linux/ASPM implementation issue?
>> [0.632170]  pci:ff: ACPI _OSC support notification failed,
> disabling
>> PCIe ASPM
>> [0.632239]  pci:ff: Unable to request _OSC control (_OSC support
>> mask: 0x08)
>
> What's the full dmesg from this machine (or is it already posted
somewhere)?
>
> It is now available here:
> http://home.comcast.net/~jpiszcz/20121128/dmesg.txt

> Is that the same boot log? It doesn't have this error in it.

Yes, the error is here: (its towards the bottom)

 [7.973015] ata14.00: qc timeout (cmd 0xa1)
[8.472120] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[9.275922] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   19.260667] ata14.00: qc timeout (cmd 0xa1)
[   19.759828] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   19.760451] ata14: limiting SATA link speed to 1.5 Gbps
[   20.566598] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[   50.521078] ata14.00: qc timeout (cmd 0xa1)
[   51.020880] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   51.824664] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[   51.824682] dmar: DRHD: handling fault status reg 502
[   51.824686] dmar: DMAR:[DMA Read] Request device [04:00.0] fault addr 0 
[   51.824686] DMAR:[fault reason 06] PTE Read access is not set
[   52.338871] EXT3-fs (sdb2): error: couldn't mount because of unsupported
optional features (240)
[   52.348938] EXT2-fs (sdb2): error: couldn't mount because of unsupported
optional features (240)
[   52.360314] EXT4-fs (sdb2): mounted filesystem with ordered data mode.
Opts: (null)

The system does not boot when the SSD is on that SATA controller.
The error we were trying to get earlier (kernel panic)-- I cannot reproduce
that anymore after adding nouveau for whatever reason.
So to re-cap it boots now with nothing connected to the controller but the
controller is non-workable/useless, as shown above.
When you put the SSD on it, it cannot mount rootfs.

Justin.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


kernel 3.4.0 (i686): pagevec_lookup_tag+0x21/0x30

2012-09-14 Thread Justin Piszcz
Hi,

Was curious why this occurs with postfix's cleanup every so often?

[562320.275125] INFO: task cleanup:5921 blocked for more than 120 seconds.
[562320.275127] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[562320.275129] cleanup D 000d 0  5921   2437 0x
[562320.275132]  f2986080 0086 ca379ed4 000d 0001 c16ca000
c1078521 000e
[562320.275137]  ca379edc  c106f3c6 0001 000e 0001
ce0a965c 
[562320.275141]  f5a84070 0003 0001 f5a8406c 0292 
f5a84014 c1043c5e
[562320.275146] Call Trace:
[562320.275148]  [] ? pagevec_lookup_tag+0x21/0x30
[562320.275151]  [] ? filemap_fdatawait_range+0x86/0x140
[562320.275154]  [] ? __wake_up+0x3e/0x60
[562320.275156]  [] ? prepare_to_wait+0x1d/0x70
[562320.275159]  [] ? jbd2_log_wait_commit+0x95/0x100
[562320.275162]  [] ? abort_exclusive_wait+0x90/0x90
[562320.275164]  [] ? ext4_sync_file+0x13c/0x2c0
[562320.275167]  [] ? chmod_common+0x74/0x90
[562320.275169]  [] ? vfs_write+0x106/0x140
[562320.275172]  [] ? ext4_flush_completed_IO+0x90/0x90
[562320.275175]  [] ? vfs_fsync+0x2b/0x40
[562320.275177]  [] ? sys_fsync+0x20/0x40
[562320.275180]  [] ? sysenter_do_call+0x12/0x26

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.6.0 kernel - ext4 corruption, or?

2012-10-24 Thread Justin Piszcz
Hi,

I read there was a bug in 3.6.2, is there also one in 3.6.0, or can someone
help explain this?
I did boot systemrescuecd 3.0.0 and ran fsck.ext4 -f partition and there
were no errors reported.
Seems the inode for the directory is missing?

# grep 10.0.0.11 -r /etc
/etc/posfix.old/master.cf:10.0.0.11:smtp inet  n   -   -
-1postscreen

# ls -l /etc/posfix.old/master.cf
-rw-r--r-- 1 root root 13236 Oct 28  2011 /etc/posfix.old/master.cf

# ls -ld /etc/postfix.old
ls: cannot access /etc/postfix.old: No such file or directory

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.6.0 kernel - ext4 corruption, or?

2012-10-24 Thread Justin Piszcz


-Original Message-
From: ty...@mit.edu [mailto:ty...@mit.edu] 
Sent: Wednesday, October 24, 2012 10:40 AM
To: Justin Piszcz; Steven J. Magnani
Cc: linux-kernel@vger.kernel.org
Subject: Re: 3.6.0 kernel - ext4 corruption, or?

On Wed, Oct 24, 2012 at 08:15:37AM -0400, Justin Piszcz wrote:
> 
> I read there was a bug in 3.6.2, is there also one in 3.6.0, or can
someone
> help explain this?

The problem which we are currently trying to investigate was
reportedly introduced in v3.6.1.  So far that's about how we know; we
have two users who have reported it, but I and other ext4 developers
haven't been able to reproduce it yet.

Got it.

> # grep 10.0.0.11 -r /etc
> /etc/posfix.old/master.cf:10.0.0.11:smtp inet  n   -   -
> -1postscreen
> 
> # ls -l /etc/posfix.old/master.cf
> -rw-r--r-- 1 root root 13236 Oct 28  2011 /etc/posfix.old/master.cf
> 
> # ls -ld /etc/postfix.old
> ls: cannot access /etc/postfix.old: No such file or directory
> 

Looks like you or some script renamed /etc/postfix to /etc/posfix.old
as part some upgrade?

Whoops-- sorry about that, typo, so for now (concerning the other bug) it's
best to stay with 3.6.0 until the bug is found, thank you.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.6.0 ext4 dump/filemap_fault?

2012-10-28 Thread Justin Piszcz
Hello,

Any idea what happened here (during a backup)?
Partition is ext4.

[116868.118797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[116868.118798] dumpD 88003d32a5c0 0 21219  21214
0x
[116868.118801]  880584631d28 0082 880584631d48
880584631fd8
[116868.118803]  880584631fd8 4000 81a14420
88003d32a5c0
[116868.118804]  8807d41449c0 88081bcea828 880584631cb8
810b88a7
[116868.118806] Call Trace:
[116868.118812]  [] ? filemap_fault+0x87/0x430
[116868.118814]  [] ? __dequeue_entity+0x2a/0x50
[116868.118817]  [] schedule+0x24/0x70
[116868.118820]  [] schedule_timeout+0x17c/0x1d0
[116868.118822]  [] ? check_preempt_curr+0x75/0xa0
[116868.118823]  [] ? ttwu_do_wakeup+0x12/0x90
[116868.118824]  [] ?
ttwu_do_activate.constprop.60+0x61/0x70
[116868.118826]  [] wait_for_common+0xc2/0x150
[116868.118827]  [] ? try_to_wake_up+0x2d0/0x2d0
[116868.118829]  [] ? fdatawrite_one_bdev+0x20/0x20
[116868.118830]  [] wait_for_completion+0x18/0x20
[116868.118832]  [] sync_inodes_sb+0x9e/0x1b0
[116868.118834]  [] ? fdatawrite_one_bdev+0x20/0x20
[116868.118835]  [] sync_inodes_one_sb+0x19/0x20
[116868.118837]  [] iterate_supers+0xe1/0xf0
[116868.118838]  [] sys_sync+0x30/0x90
[116868.118839]  [] system_call_fastpath+0x1a/0x1f

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.6.0 ext4 dump/filemap_fault?

2012-10-29 Thread Justin Piszcz
On Sun, Oct 28, 2012 at 6:51 PM, Theodore Ts'o  wrote:
> On Sun, Oct 28, 2012 at 05:02:17PM -0400, Justin Piszcz wrote:
>> Hello,
>>
>> Any idea what happened here (during a backup)?
>
> A sync system call took longer than two mintues.  Why that happened,
> it's harder to say.  It's a warning, though, and not a fatal panic or
> kernel oops.
Ah, got it.

>
> How much memory do you have in your system?
32GB memory/32GB swap

> What happened afterwards?
It did eventually complete.

> Did the system continue, and did the sync command (I presume you ran
> "sync" from the command line?)  finally return to the command prompt?
In this case I did not run sync, I waited for the processes/dump/etc
to complete.

Thanks.

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [lm-sensors] 3.10: NCT6776F sensor question with Supermicro X9SRL-F motherboard

2013-07-11 Thread Justin Piszcz
Thanks for these details!

On Thu, Jul 11, 2013 at 4:47 AM, Jean Delvare  wrote:
> Hi Guenter, Justin,
>
> On Wed, 3 Jul 2013 07:42:01 -0700, Guenter Roeck wrote:
>> On Wed, Jul 03, 2013 at 08:35:59AM -0400, Justin Piszcz wrote:
>> > I also found:
>> > http://www.lm-sensors.org/wiki/Configurations/SuperMicro/X9SRA
>> >
>> > Does Super Micro also have such a config file for the X9SRL-F?
>> > This board uses a NCT6776F.
>>
>> Supermicro does not provide configuration files. You can take the above file,
>> test and update it, and let us know so we can add it to the wiki.
>
> Actually they do provide configuration information. They have their own
> tool names SuperDoctor, which can be downloaded from:
>   ftp://ftp.supermicro.com/utility/SuperDoctor_II/Linux/Release/
>
> If you look at file AllSuperD.ini, you'll find per-board entries
> describing the input mapping. It is very helpful when writing a custom
> libsensors configuration file for the board in question. It takes some
> knowledge of the monitoring chip or its driver though, as the ini file
> references register addresses ("Offset" in the file.)
>
> A gave a quick look and at least the voltage input mapping (and
> presumably the voltage scaling factors as well) is similar to the
> X9SRA so you can reuse this part of the X9SRA configuration file.
>
> Hope that helps,
> --
> Jean Delvare
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[no subject]

2013-07-13 Thread Justin Piszcz
subscribe linux-raid
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.10: discard/trim support on md-raid1?

2013-07-13 Thread Justin Piszcz
Hello,

Running 3.10 and I see the following for an md-raid1 of two SSDs:

Checking /sys/block/md1/queue:
add_random: 0
discard_granularity: 512
discard_max_bytes: 2147450880
discard_zeroes_data: 0
hw_sector_size: 512
iostats: 0
logical_block_size: 512
max_hw_sectors_kb: 32767
max_integrity_segments: 0
max_sectors_kb: 512
max_segment_size: 65536
max_segments: 168
minimum_io_size: 512
nomerges: 0
nr_requests: 128
optimal_io_size: 0
physical_block_size: 512
read_ahead_kb: 8192
rotational: 1
rq_affinity: 0
scheduler: none
write_same_max_bytes: 0

What should be seen:
rotational: 0
And possibly:
discard_zeroes_data: 1

Can anyone confirm if there is a workaround to allow TRIM when using
md-raid1?

Some related discussion here:
http://us.generation-nt.com/answer/md-rotational-attribute-help-206571222.ht
ml
http://www.progtown.com/topic343938-ssd-strange-itself-conducts.html


Justin.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.11 kernel: perf samples too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

2013-09-10 Thread Justin Piszcz
Hello,

I recent upgraded from 3.10 to 3.11 and see these on occasion in the kernel log:

perf samples too long (2501 > 2500), lowering
kernel.perf_event_max_sample_rate to 5
perf samples too long (5040 > 5000), lowering
kernel.perf_event_max_sample_rate to 25000

I was curious what is causing this/recommendation to fix this problem?

Thanks,

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.6.0 ACPI: e1000e - Invalid Power Resource to register! (Supermicro X9SCM-F) - Disabling IRQ #44

2012-10-08 Thread Justin Piszcz
Hello,

Kernel: 3.6.0 (x86_64)
Distribution: Debian Testing

I was copying 600GB of files from Samba/Linux to a Windows host, it
copied around 500GB, then this happened, it disabled the network
interface on my Supermicro X9CM-F board (on-board) -- the first
interface, any idea why this happened?

[93593.565667] irq 44: nobody cared (try booting with the "irqpoll" option)
[93593.565673] Pid: 0, comm: swapper/0 Not tainted 3.6.0 #4
[93593.565675] Call Trace:
[0.971861] ACPI: Invalid Power Resource to register!
[93593.565677][] __report_bad_irq+0x31/0xd0
[93593.565690]  [] note_interrupt+0x1a3/0x1f0
[93593.565694]  [] handle_irq_event_percpu+0x89/0x160
[93593.565697]  [] handle_irq_event+0x3c/0x60
[93593.565700]  [] handle_edge_irq+0x6f/0x110
[93593.565705]  [] handle_irq+0x1d/0x30
[93593.565709]  [] do_IRQ+0x55/0xd0
[93593.565714]  [] common_interrupt+0x67/0x67
[93593.565715][] ?
__hrtimer_start_range_ns+0x1bd/0x3b0
[93593.565728]  [] ? acpi_idle_enter_c1+0xaa/0xcf
[93593.565731]  [] ? acpi_idle_enter_c1+0x89/0xcf
[93593.565735]  [] cpuidle_enter+0x19/0x20
[93593.565738]  [] cpuidle_idle_call+0x88/0x100
[93593.565750]  [] cpu_idle+0x5f/0xd0
[93593.565752]  [] rest_init+0x68/0x74
[93593.565755]  [] start_kernel+0x2a8/0x2b5
[93593.565756]  [] ? repair_env_string+0x5e/0x5e
[93593.565758]  [] x86_64_start_reservations+0x101/0x105
[93593.565759]  [] x86_64_start_kernel+0xd8/0xdc
[93593.565760] handlers:
[93593.565762] [] e1000_msix_other
[93593.565763] Disabling IRQ #44

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.6.0 ACPI: e1000e - Invalid Power Resource to register! (Supermicro X9SCM-F) - Disabling IRQ #44

2012-10-08 Thread Justin Piszcz
Kernel: 3.6.0 (x86_64)
Distribution: Debian Testing

I was copying 600GB of files from Samba/Linux to a Windows host, it
copied around 500GB, then this happened, it disabled the network
interface on my Supermicro X9CM-F board (on-board) -- the first
interface, any idea why this happened?

[93593.565667] irq 44: nobody cared (try booting with the "irqpoll" option)
[93593.565673] Pid: 0, comm: swapper/0 Not tainted 3.6.0 #4
[93593.565675] Call Trace:
[0.971861] ACPI: Invalid Power Resource to register!
[93593.565677][] __report_bad_irq+0x31/0xd0
[93593.565690]  [] note_interrupt+0x1a3/0x1f0
[93593.565694]  [] handle_irq_event_percpu+0x89/0x160
[93593.565697]  [] handle_irq_event+0x3c/0x60
[93593.565700]  [] handle_edge_irq+0x6f/0x110
[93593.565705]  [] handle_irq+0x1d/0x30
[93593.565709]  [] do_IRQ+0x55/0xd0
[93593.565714]  [] common_interrupt+0x67/0x67
[93593.565715][] ?
__hrtimer_start_range_ns+0x1bd/0x3b0
[93593.565728]  [] ? acpi_idle_enter_c1+0xaa/0xcf
[93593.565731]  [] ? acpi_idle_enter_c1+0x89/0xcf
[93593.565735]  [] cpuidle_enter+0x19/0x20
[93593.565738]  [] cpuidle_idle_call+0x88/0x100
[93593.565750]  [] cpu_idle+0x5f/0xd0
[93593.565752]  [] rest_init+0x68/0x74
[93593.565755]  [] start_kernel+0x2a8/0x2b5
[93593.565756]  [] ? repair_env_string+0x5e/0x5e
[93593.565758]  [] x86_64_start_reservations+0x101/0x105
[93593.565759]  [] x86_64_start_kernel+0xd8/0xdc
[93593.565760] handlers:
[93593.565762] [] e1000_msix_other
[93593.565763] Disabling IRQ #44

---

A known issue with this hardware/nics it appears:
https://www.centos.org/modules/newbb/viewtopic.php?topic_id=34820

Hopefully someone from Intel can chime in.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


X9SCM-F/82574L/e1000e lag / high latency (e1000e/Intel bug)

2012-10-09 Thread Justin Piszcz
Hi,

Good news: Supermicro 2.0b fixes an unrelated problem where only 16GB is
addressed in the BIOS when you have 32GB on the system, with 2.0b that is
resolved.

Bad news: This bug still remains (E1000): When you transfer a file/files
over Samba, the latency shoots up really high (this also affects other
applications!)

This bug has been bothering me for months (random lag) during high network
I/O on my X9SCM-F motherboard.
There is a lot of discussion about this problem here:
http://sourceforge.net/p/e1000/bugs/27/?page=4

I tried the EEPROM fix but it did not work:
http://sourceforge.net/projects/e1000/files/e1000e%20stable/eeprom_fix_82574
_or_82583/

"The value at offset 0x001e (58) has bit 1 unset. This enables the
problematic
power saving feature. In this case, the EEPROM needs to read "5a" at offset
0x001e."

# ethtool -e eth4 |grep 0x0010
0x0010: ff ff ff ff 6b 02 00 00 d9 15 d3 10 ff ff 5a a5

Yes I did reboot.

Here is what the problem looks like, during a SAMBA copy from A->B where
B=X9SCM running Linux:

(ONBOARD Intel eth0 / 82574L )

$ ping windowspc
PING windowspc (192.168.0.1) 56(84) bytes of data.
64 bytes from windowspc (192.168.0.1): icmp_req=1 ttl=128 time=0.544 ms
64 bytes from windowspc (192.168.0.1): icmp_req=2 ttl=128 time=0.193 ms
64 bytes from windowspc (192.168.0.1): icmp_req=3 ttl=128 time=0.619 ms
64 bytes from windowspc (192.168.0.1): icmp_req=4 ttl=128 time=0.642 ms
64 bytes from windowspc (192.168.0.1): icmp_req=5 ttl=128 time=0.426 ms
64 bytes from windowspc (192.168.0.1): icmp_req=6 ttl=128 time=0.464 ms
64 bytes from windowspc (192.168.0.1): icmp_req=7 ttl=128 time=0.696 ms
64 bytes from windowspc (192.168.0.1): icmp_req=8 ttl=128 time=1353 ms
64 bytes from windowspc (192.168.0.1): icmp_req=9 ttl=128 time=353 ms
64 bytes from windowspc (192.168.0.1): icmp_req=10 ttl=128 time=0.492 ms
64 bytes from windowspc (192.168.0.1): icmp_req=11 ttl=128 time=0.618 ms
64 bytes from windowspc (192.168.0.1): icmp_req=12 ttl=128 time=0.474 ms
64 bytes from windowspc (192.168.0.1): icmp_req=13 ttl=128 time=0.542 ms
64 bytes from windowspc (192.168.0.1): icmp_req=14 ttl=128 time=0.471 ms
64 bytes from windowspc (192.168.0.1): icmp_req=15 ttl=128 time=0.645 ms
64 bytes from windowspc (192.168.0.1): icmp_req=16 ttl=128 time=0.394 ms
64 bytes from windowspc (192.168.0.1): icmp_req=17 ttl=128 time=0.537 ms
64 bytes from windowspc (192.168.0.1): icmp_req=18 ttl=128 time=0.706 ms
64 bytes from windowspc (192.168.0.1): icmp_req=19 ttl=128 time=0.465 ms
64 bytes from windowspc (192.168.0.1): icmp_req=20 ttl=128 time=0.707 ms
64 bytes from windowspc (192.168.0.1): icmp_req=21 ttl=128 time=348 ms
64 bytes from windowspc (192.168.0.1): icmp_req=22 ttl=128 time=0.703 ms
64 bytes from windowspc (192.168.0.1): icmp_req=23 ttl=128 time=0.560 ms
64 bytes from windowspc (192.168.0.1): icmp_req=24 ttl=128 time=0.554 ms
64 bytes from windowspc (192.168.0.1): icmp_req=25 ttl=128 time=0.585 ms
64 bytes from windowspc (192.168.0.1): icmp_req=26 ttl=128 time=0.508 ms
64 bytes from windowspc (192.168.0.1): icmp_req=27 ttl=128 time=345 ms
64 bytes from windowspc (192.168.0.1): icmp_req=28 ttl=128 time=0.374 ms
64 bytes from windowspc (192.168.0.1): icmp_req=29 ttl=128 time=0.728 ms
64 bytes from windowspc (192.168.0.1): icmp_req=30 ttl=128 time=0.537 ms
64 bytes from windowspc (192.168.0.1): icmp_req=31 ttl=128 time=0.190 ms
64 bytes from windowspc (192.168.0.1): icmp_req=32 ttl=128 time=0.204 ms
64 bytes from windowspc (192.168.0.1): icmp_req=33 ttl=128 time=0.239 ms

Same test (copy test) with samba as above but now with an Intel 4-port NIC:
$ ping windowspc
64 bytes from windowspc (192.168.0.1): icmp_req=1 ttl=128 time=0.175 ms
64 bytes from windowspc (192.168.0.1): icmp_req=2 ttl=128 time=0.332 ms
64 bytes from windowspc (192.168.0.1): icmp_req=3 ttl=128 time=0.276 ms
64 bytes from windowspc (192.168.0.1): icmp_req=4 ttl=128 time=0.221 ms
64 bytes from windowspc (192.168.0.1): icmp_req=5 ttl=128 time=0.518 ms
64 bytes from windowspc (192.168.0.1): icmp_req=6 ttl=128 time=0.157 ms
64 bytes from windowspc (192.168.0.1): icmp_req=7 ttl=128 time=0.222 ms
64 bytes from windowspc (192.168.0.1): icmp_req=8 ttl=128 time=0.605 ms
64 bytes from windowspc (192.168.0.1): icmp_req=9 ttl=128 time=0.335 ms
64 bytes from windowspc (192.168.0.1): icmp_req=10 ttl=128 time=0.679 ms
64 bytes from windowspc (192.168.0.1): icmp_req=11 ttl=128 time=0.223 ms
64 bytes from windowspc (192.168.0.1): icmp_req=12 ttl=128 time=0.189 ms
64 bytes from windowspc (192.168.0.1): icmp_req=13 ttl=128 time=0.432 ms
64 bytes from windowspc (192.168.0.1): icmp_req=14 ttl=128 time=0.235 ms
64 bytes from windowspc (192.168.0.1): icmp_req=15 ttl=128 time=0.386 ms
64 bytes from windowspc (192.168.0.1): icmp_req=16 ttl=128 time=0.658 ms
64 bytes from windowspc (192.168.0.1): icmp_req=17 ttl=128 time=0.430 ms
64 bytes from windowspc (192.168.0.1): icmp_req=18 ttl=128 time=0.494 ms
64 bytes from windowspc (192.168.0.1): icmp_req=19 ttl=128 time=0.

RE: X9SCM-F/82574L/e1000e lag / high latency (e1000e/Intel bug)

2012-10-10 Thread Justin Piszcz


-Original Message-
From: Justin Piszcz [mailto:jpis...@lucidpixels.com] 
Sent: Tuesday, October 09, 2012 6:15 PM
To: linux-kernel@vger.kernel.org; e1000-de...@lists.sf.net
Cc: supp...@supermicro.com; a...@solarrain.com
Subject: X9SCM-F/82574L/e1000e lag / high latency (e1000e/Intel bug)

If you have > 16GB do not upgrade to 2.0b, I am writing up a page on this
and will post it shortly, the board will show 32GB with 2.0b but when it
tries to address it, the machine reboots.  Stick with 2.0a for now, I will
call Supermicro later, will post a page on all of these issues in a bit.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: X9SCM-F/82574L/e1000e lag / high latency (e1000e/Intel bug)

2012-10-10 Thread Justin Piszcz


-Original Message-
From: Justin Piszcz [mailto:jpis...@lucidpixels.com] 
Sent: Tuesday, October 09, 2012 6:15 PM
To: linux-kernel@vger.kernel.org; e1000-de...@lists.sf.net
Cc: supp...@supermicro.com; a...@solarrain.com
Subject: X9SCM-F/82574L/e1000e lag / high latency (e1000e/Intel bug)

If you have > 16GB do not upgrade to 2.0b, I am writing up a page on this
and will post it shortly, the board will show 32GB with 2.0b but when it
tries to address it, the machine reboots.  Stick with 2.0a for now, I will
call Supermicro later, will post a page on all of these issues in a bit.

--

As promised:
https://sites.google.com/a/lucidpixels.com/web/blog/supermicrox9scm-fissues

Summary:

1. the board does not always address 32GB of ram if you are using an Ivy
Bridge chip on this motherboard
2. e1000e network problems
a) During heavy network I/O (file copy) on eth0 the network latency
jumps to 300-1000ms+ every 4-5 seconds (it does not do this on a separate
card)
b) During heavy network I/O (file copy) of over 600GB of files, the
kernel disabled the network IRQ on eth0 and took the server offline from a
network perspective (it has not done this on a separate card..yet)
3. Problems with PCI-e cards
summary: Don't expect to use all four PCI-e slots if they use a lot of
power
4. clock drift issues:
  summary: expect some strangeness if you use gpsd/a gps to help sync your
time, due to what SM noted below
http://lists.ntp.org/pipermail/pool/2012-July/006019.html

here is a picture of memtest86 showing "Unexpected Interrupt - Halting" when
2.0b BIOS is used with 32gb of ram:
http://home.comcast.net/~jpiszcz/20121010/x9scm-web-small.jpg

here is the issue with memtest86 (crashes/reboots host when you have 2.0b +
32gb of ram):
https://www.youtube.com/watch?feature=player_embedded&v=M2TWO5kFm9U

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.10 kernel: [drm:evergreen_startup] *ERROR* radeon: error initializing UVD (-1).

2013-07-02 Thread Justin Piszcz
Hello,

I use an ATI graphics card (PCI-e x1) for a server:
Card: [AMD/ATI] Park [Mobility Radeon HD 5430]

I upgraded my kernel from 3.9.x to 3.10, after rebooting I found the driver
now wants a new firmware:
radeon :05:00.0: radeon_uvd: Can't load firmware
"radeon/CYPRESS_uvd.bin"

I pulled the latest firmware down (and also updated the others) and
rebooted:
CONFIG_EXTRA_FIRMWARE="radeon/CEDAR_me.bin radeon/CEDAR_pfp.bin
radeon/CEDAR_rlc.bin radeon/CYPRESS_uvd.bin"

Then have this issue:
[drm] radeon: irq initialized.
[drm:evergreen_startup] *ERROR* radeon: error initializing UVD (-1).

The screen goes black for ~5 seconds and then come back on.
Not sure if this has been reported or not (w/CEDAR chipset) but FYI.

Kernel config:
http://home.comcast.net/~jpiszcz/20130702/config-3.10-4.txt

Full dmesg:
http://home.comcast.net/~jpiszcz/20130702/full-dmesg.txt

Possibly related:
http://lists.opensuse.org/opensuse-bugs/2013-06/msg00053.html
https://bugzilla.novell.com/show_bug.cgi?id=822777
https://bugzilla.novell.com/show_bug.cgi?id=822777#c0
http://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg38066.html
https://bugs.freedesktop.org/show_bug.cgi?id=63935

Snippet:
[2.033535] [drm:r600_uvd_init] *ERROR* UVD not responding, trying to
reset the VCPU!!!
[3.053742] [drm:r600_uvd_init] *ERROR* UVD not responding, trying to
reset the VCPU!!!
[4.073985] [drm:r600_uvd_init] *ERROR* UVD not responding, trying to
reset the VCPU!!!
[5.094192] [drm:r600_uvd_init] *ERROR* UVD not responding, trying to
reset the VCPU!!!
[6.114405] [drm:r600_uvd_init] *ERROR* UVD not responding, trying to
reset the VCPU!!!
[7.134611] [drm:r600_uvd_init] *ERROR* UVD not responding, trying to
reset the VCPU!!!
[8.154824] [drm:r600_uvd_init] *ERROR* UVD not responding, trying to
reset the VCPU!!!
[9.175044] [drm:r600_uvd_init] *ERROR* UVD not responding, trying to
reset the VCPU!!!
[   10.195251] [drm:r600_uvd_init] *ERROR* UVD not responding, trying to
reset the VCPU!!!
[   11.215464] [drm:r600_uvd_init] *ERROR* UVD not responding, trying to
reset the VCPU!!!
[   11.235537] [drm:r600_uvd_init] *ERROR* UVD not responding, giving up!!!
[   11.235607] [drm:evergreen_startup] *ERROR* radeon: error initializing
UVD (-1).

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.10 kernel: [drm:evergreen_startup] *ERROR* radeon: error initializing UVD (-1).

2013-07-02 Thread Justin Piszcz


-Original Message-
From: Alex Deucher [mailto:alexdeuc...@gmail.com] 
Sent: Tuesday, July 02, 2013 2:36 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; dri-de...@lists.freedesktop.org
Subject: Re: 3.10 kernel: [drm:evergreen_startup] *ERROR* radeon: error
initializing UVD (-1).

> The screen goes black for ~5 seconds and then come back on.
> Not sure if this has been reported or not (w/CEDAR chipset) but FYI.

Make sure you have the the latest CEDAR_rlc.bin in addition to
CYPRESS_uvd.bin and make sure the latest images are available in your
initrd or kernel image, etc.

Alex

--

Hi,

Even with the latest (confirmed below) and built-in to the kernel:

$ git clone
git://git.kernel.org/pub/scm/linux/kernel/git/dwmw2/linux-firmware.git
Cloning into 'linux-firmware'...
remote: Counting objects: 2210, done.
remote: Compressing objects: 100% (1102/1102), done.
remote: Total 2210 (delta 1123), reused 2113 (delta 1073)
Receiving objects: 100% (2210/2210), 39.84 MiB | 6.49 MiB/s, done.
Resolving deltas: 100% (1123/1123), done.
$ date
Tue Jul  2 14:39:19 EDT 2013

2b244d41832f46382bfbb8994522dcdd  d/linux-firmware/radeon/CEDAR_me.bin
23915e382ea0d2f2491a19146ca3001c  d/linux-firmware/radeon/CEDAR_pfp.bin
e8770d3d588f24dc6f1a8609c9db3467  d/linux-firmware/radeon/CEDAR_rlc.bin
fb23b281dcc94a035d374e709c9842bd  d/linux-firmware/radeon/CYPRESS_uvd.bin

Current system firmware (/lib/firmware/radeon):
$ for i in $CONFIG_EXTRA_FIRMWARE; do md5sum $i; done
2b244d41832f46382bfbb8994522dcdd  radeon/CEDAR_me.bin
23915e382ea0d2f2491a19146ca3001c  radeon/CEDAR_pfp.bin
e8770d3d588f24dc6f1a8609c9db3467  radeon/CEDAR_rlc.bin
fb23b281dcc94a035d374e709c9842bd  radeon/CYPRESS_uvd.bin

CONFIG_EXTRA_FIRMWARE="radeon/CEDAR_me.bin radeon/CEDAR_pfp.bin
radeon/CEDAR_rlc.bin radeon/CYPRESS_uvd.bin"

Same issue.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.10: NCT6776F sensor question with Supermicro X9SRL-F motherboard

2013-07-03 Thread Justin Piszcz
Hello,

Currently running 3.10 with:
CONFIG_SENSORS_NCT6775=y

Motherboard: Supermicro X9SRL-F

A couple questions:
1) Was curious if the PCH CHIP/CPU/MCH temperatures should be populated for
this board? 
2) Additionally, why is the CPUTIN in alarm?

I also found:
http://www.lm-sensors.org/wiki/Configurations/SuperMicro/X9SRA

Does Super Micro also have such a config file for the X9SRL-F?
This board uses a NCT6776F.

Relevant output from lm_sensors 3.6.0+dfsg1-1:

nct6776-isa-0a30
Adapter: ISA adapter
Vcore:  +0.81 V  (min =  +0.54 V, max =  +1.49 V)
in1:+1.85 V  (min =  +1.62 V, max =  +1.99 V)
AVCC:   +3.30 V  (min =  +2.98 V, max =  +3.63 V)
+3.3V:  +3.30 V  (min =  +2.98 V, max =  +3.63 V)
in4:+1.51 V  (min =  +1.35 V, max =  +1.65 V)
in5:+1.27 V  (min =  +1.13 V, max =  +1.38 V)
in6:+1.06 V  (min =  +0.92 V, max =  +1.34 V)
3VSB:   +3.57 V  (min =  +2.98 V, max =  +3.63 V)
Vbat:   +3.49 V  (min =  +2.70 V, max =  +3.63 V)
fan1:   986 RPM  (min =  700 RPM)
fan2:  1322 RPM  (min =  700 RPM)
fan3:  1103 RPM  (min =  700 RPM)
fan4:  1080 RPM  (min =  700 RPM)
fan5:  1001 RPM  (min =  700 RPM)
SYSTIN: +42.0 C  (high = +75.0 C, hyst = +70.0 C)  sensor =
thermistor
CPUTIN: +33.0 C  (high = +95.0 C, hyst = +92.0 C)  ALARM  sensor =
thermistor
AUXTIN: +23.0 C  (high = +80.0 C, hyst = +75.0 C)  sensor =
thermistor
PECI Agent 0:+0.0 C  (high = +95.0 C, hyst = +92.0 C)
 (crit = +100.0 C)
PCH_CHIP_TEMP:   +0.0 C  
PCH_CPU_TEMP:+0.0 C  
PCH_MCH_TEMP:+0.0 C  
intrusion0:ALARM
intrusion1:ALARM

sensors3.conf snippet:

chip "w83627ehf-*" "w83627dhg-*" "w83667hg-*" "nct6775-*" "nct6776-*"

label in0 "Vcore"
label in2 "AVCC"
label in3 "+3.3V"
label in7 "3VSB"
label in8 "Vbat"

set in2_min  3.3 * 0.90
set in2_max  3.3 * 1.10
set in3_min  3.3 * 0.90
set in3_max  3.3 * 1.10
set in7_min  3.3 * 0.90
set in7_max  3.3 * 1.10
set in8_min  3.0 * 0.90
set in8_max  3.3 * 1.10

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.10: Intel HWMON/NIC temperature sensor question

2013-07-03 Thread Justin Piszcz
Hi,

I saw this in the device drive section and was curious which Intel-based
NICs contain temperature sensors?

Intel(R) 10GbE PCI Express adapters HWMON support
Intel(R) PCI-Express Gigabit adapters HWMON support

I checked the boards below and none appear to expose a hwmon interface:
08:00.0 Ethernet controller: Intel Corporation 82580 Gigabit Network
Connection (rev 01)
08:00.1 Ethernet controller: Intel Corporation 82580 Gigabit Network
Connection (rev 01)
08:00.2 Ethernet controller: Intel Corporation 82580 Gigabit Network
Connection (rev 01)
08:00.3 Ethernet controller: Intel Corporation 82580 Gigabit Network
Connection (rev 01)
09:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
0a:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
01:00.0 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AT2 Server
Adapter (rev 01)

# find /sys/|grep hwmon

Radeon:
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/name
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/power
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/power/control
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/power/runtime
_active_time
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/power/autosus
pend_delay_ms
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/power/runtime
_status
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/power/runtime
_suspended_time
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/device
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/subsystem
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/uevent
/sys/devices/pci:00/:00:03.1/:05:00.0/hwmon/hwmon0/temp1_input

Onboard chipset:
/sys/devices/platform/nct6775.2608/hwmon
/sys/devices/platform/nct6775.2608/hwmon/hwmon2
/sys/devices/platform/nct6775.2608/hwmon/hwmon2/power
/sys/devices/platform/nct6775.2608/hwmon/hwmon2/power/control
/sys/devices/platform/nct6775.2608/hwmon/hwmon2/power/runtime_active_time
/sys/devices/platform/nct6775.2608/hwmon/hwmon2/power/autosuspend_delay_ms
/sys/devices/platform/nct6775.2608/hwmon/hwmon2/power/runtime_status
/sys/devices/platform/nct6775.2608/hwmon/hwmon2/power/runtime_suspended_time
/sys/devices/platform/nct6775.2608/hwmon/hwmon2/device
/sys/devices/platform/nct6775.2608/hwmon/hwmon2/subsystem
/sys/devices/platform/nct6775.2608/hwmon/hwmon2/uevent

CPU:
/sys/devices/platform/coretemp.0/hwmon
/sys/devices/platform/coretemp.0/hwmon/hwmon1
/sys/devices/platform/coretemp.0/hwmon/hwmon1/power
/sys/devices/platform/coretemp.0/hwmon/hwmon1/power/control
/sys/devices/platform/coretemp.0/hwmon/hwmon1/power/runtime_active_time
/sys/devices/platform/coretemp.0/hwmon/hwmon1/power/autosuspend_delay_ms
/sys/devices/platform/coretemp.0/hwmon/hwmon1/power/runtime_status
/sys/devices/platform/coretemp.0/hwmon/hwmon1/power/runtime_suspended_time
/sys/devices/platform/coretemp.0/hwmon/hwmon1/device
/sys/devices/platform/coretemp.0/hwmon/hwmon1/subsystem
/sys/devices/platform/coretemp.0/hwmon/hwmon1/uevent

Justin.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.10 kernel: [drm:evergreen_startup] *ERROR* radeon: error initializing UVD (-1). [SOLVED]

2013-07-03 Thread Justin Piszcz


-Original Message-
From: Alex Deucher [mailto:alexdeuc...@gmail.com] 
Sent: Tuesday, July 02, 2013 3:41 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; dri-de...@lists.freedesktop.org
Subject: Re: 3.10 kernel: [drm:evergreen_startup] *ERROR* radeon: error
initializing UVD (-1).

Please open a bug (product: DRI, component: DRM/Radeon):
https://bugs.freedesktop.org
and attach your dmesg output and xorg log.

Alex

Per:
https://bugs.freedesktop.org/show_bug.cgi?id=66519

Grabbed firmware, removed distribution firmware package, re-built kernel and
all is working now, thanks!

Justin.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [lm-sensors] 3.10: NCT6776F sensor question with Supermicro X9SRL-F motherboard

2013-07-03 Thread Justin Piszcz


-Original Message-
From: Guenter Roeck [mailto:li...@roeck-us.net] 
Sent: Wednesday, July 03, 2013 10:42 AM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; lm-sens...@lm-sensors.org
Subject: Re: [lm-sensors] 3.10: NCT6776F sensor question with Supermicro
X9SRL-F motherboard

[ .. ]

This is surprising and might be where the alarm comes from. What output do
you
get if you load the coretemp driver ? 

coretemp-isa-
Adapter: ISA adapter
Physical id 0:  +38.0 C  (high = +81.0 C, crit = +91.0 C)
Core 0: +36.0 C  (high = +81.0 C, crit = +91.0 C)
Core 1: +35.0 C  (high = +81.0 C, crit = +91.0 C)
Core 2: +35.0 C  (high = +81.0 C, crit = +91.0 C)
Core 3: +34.0 C  (high = +81.0 C, crit = +91.0 C)
Core 4: +37.0 C  (high = +81.0 C, crit = +91.0 C)
Core 5: +38.0 C  (high = +81.0 C, crit = +91.0 C)

> PCH_CHIP_TEMP:   +0.0 C  
> PCH_CPU_TEMP:+0.0 C  
> PCH_MCH_TEMP:+0.0 C  
> intrusion0:ALARM
> intrusion1:ALARM

Are those not connected ?

The intrusion headers are not connected, also, I have not dug into it but
when you try to ignore the PECI or those PCH* lm_sensors seems to ignore the
rule.

sensors3.conf:
ignore PCH_CHIP_TEMP
ignore PCH_CPU_TEMP
ignore PCH_MCH_TEMP

$ sensors  |tail -n 4
PCH_CHIP_TEMP:   +0.0 C
PCH_CPU_TEMP:+0.0 C
PCH_MCH_TEMP:+0.0 C

Ignoring intrusion works though:
ignore intrusion0
ignore intrusion1


nct6776-isa-0a30
Adapter: ISA adapter
Vcore:  +0.84 V  (min =  +0.54 V, max =  +1.49 V)
in1:+1.84 V  (min =  +1.62 V, max =  +1.99 V)
AVCC:   +3.28 V  (min =  +2.98 V, max =  +3.63 V)
+3.3V:  +3.28 V  (min =  +2.98 V, max =  +3.63 V)
in4:+1.50 V  (min =  +1.35 V, max =  +1.65 V)
in5:+1.26 V  (min =  +1.13 V, max =  +1.38 V)
in6:+1.06 V  (min =  +0.92 V, max =  +1.34 V)
3VSB:   +3.57 V  (min =  +2.98 V, max =  +3.63 V)
Vbat:   +3.47 V  (min =  +2.70 V, max =  +3.63 V)
fan1:  1007 RPM  (min =  700 RPM)
fan2:  1317 RPM  (min =  700 RPM)
fan3:  1102 RPM  (min =  700 RPM)
fan4:  1059 RPM  (min =  700 RPM)
fan5:   998 RPM  (min =  700 RPM)
SYSTIN: +40.0 C  (high = +75.0 C, hyst = +70.0 C)  sensor =
thermistor
CPUTIN: +31.5 C  (high = +95.0 C, hyst = +92.0 C)  ALARM  sensor =
thermistor
AUXTIN: +23.0 C  (high = +80.0 C, hyst = +75.0 C)  sensor =
thermistor
PECI Agent 0:+0.0 C  (high = +95.0 C, hyst = +92.0 C)
 (crit = +100.0 C)
PCH_CHIP_TEMP:   +0.0 C
PCH_CPU_TEMP:+0.0 C
PCH_MCH_TEMP:+0.0 C

Thanks,
Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [lm-sensors] 3.10: NCT6776F sensor question with Supermicro X9SRL-F motherboard

2013-07-03 Thread Justin Piszcz


-Original Message-
From: Guenter Roeck [mailto:li...@roeck-us.net] 
Sent: Wednesday, July 03, 2013 12:33 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; lm-sens...@lm-sensors.org
Subject: Re: [lm-sensors] 3.10: NCT6776F sensor question with Supermicro
X9SRL-F motherboard

[ .. ]
> 
Can you install superiotool and run "sudo superiotool -V -e" ?
I would like to see raw data from the superio chip.

# superiotool -V -e
superiotool r6637
..
Probing for Nuvoton Super I/O (sid=0xfc) at 0x164e...
Found Nuvoton WPCM450 (id=0x1a11, rev=0x00) at 0x164e
Probing for Nuvoton Super I/O at 0x2e...
Found Nuvoton NCT6776F (C) (id=0xc333) at 0x2e

[ .. ]

You have to specify the raw attribute names, not the symbolic ones.
You see the raw attribute names with "sensors -u".

# sensors -u
PCH_CHIP_TEMP:
  temp8_input: 0.000
PCH_CPU_TEMP:
  temp9_input: 0.000
PCH_MCH_TEMP:
  temp10_input: 0.000

Tried:
ignore PCH_CHIP_TEMP
ignore temp8_input

For some reason still can't get rid of it appearing:
PCH_CHIP_TEMP:   +0.0 C  
(..as well as the others)


> $ sensors  |tail -n 4
> PCH_CHIP_TEMP:   +0.0 C
> PCH_CPU_TEMP:+0.0 C
> PCH_MCH_TEMP:+0.0 C
> 
> Ignoring intrusion works though:
> ignore intrusion0
> ignore intrusion1
> 
Does the board have intrusion detection headers ? If so, you could close
(bridge) the header(s) which should get rid of the alarm.

ftp://ftp.supermicro.com/CDR-X9-UP_1.21_for_Intel_X9_UP_platform/MANUALS/X9S
RL-F/X9SRL-F.pdf
@ page 49 (2-23) [motherboard: JL1 2-pin header]

-



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [lm-sensors] 3.10: NCT6776F sensor question with Supermicro X9SRL-F motherboard

2013-07-03 Thread Justin Piszcz


-Original Message-
From: Guenter Roeck [mailto:li...@roeck-us.net] 
Sent: Wednesday, July 03, 2013 1:17 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; lm-sens...@lm-sensors.org
Subject: Re: [lm-sensors] 3.10: NCT6776F sensor question with Supermicro
X9SRL-F motherboard

[ .. ]

Can you try http://roeck-us.net/linux/bin/superiotool ?

Sure, below:

$ sudo ./superiotool 
superiotool r4.0-2514-gf419483
Found Nuvoton WPCM450 (id=0x1a11, rev=0x00) at 0x164e
Found Nuvoton NCT6776F (C) (id=0xc333) at 0x2e
$ sudo ./superiotool -V -e
Probing for Nuvoton Super I/O (sid=0xfc) at 0x164e...
Found Nuvoton WPCM450 (id=0x1a11, rev=0x00) at 0x164e
Probing for Nuvoton Super I/O at 0x2e...
Found Nuvoton NCT6776F (C) (id=0xc333) at 0x2e
Bank 0:
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 
00: 03 33 03 33 00 ff ff ff ff ff ff ff ff ff ff ff 
01: 04 ff 00 00 00 00 ff ff 00 00 00 00 00 83 00 00 
02: 69 e6 ce ce bc 9e 84 2b ff ff ff ba 43 f9 cb e3 
03: ba e3 ba ce a9 ac 8d a8 73 4b 46 ff ff ff ff ff 
04: 03 fe 58 ff ff 80 3f ff 2d ff ff ff 10 05 00 a3 
05: ff ff ff ff ff ff ff ff c1 ff ff ff ff 01 00 ff 
06: 00 ff ff ff ff 01 07 ff ff ff ff ff ff ff ff ff 
07: 00 0a 00 21 00 00 00 17 00 ff ff ff ff ff ff ff 
08: ff 03 1f 0f ff 3c 3c 3c 00 00 00 00 00 00 00 00 
09: 0a 00 00 00 00 0a 0a 0a 0a aa ef 80 ff 40 46 c4 
0a: 0e 01 00 00 ff 00 00 ff 00 00 80 66 66 06 01 01 
0b: 00 00 00 00 02 01 35 00 1c 00 00 04 38 c0 c4 ff 
0c: 01 00 00 00 00 00 00 00 00 07 02 ff ff ff ff ff 
0d: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0e: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0f: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
Bank 1:
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 
00: 82 00 00 01 01 28 01 3c ff 33 ff ff 00 ff ff ff 
01: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
02: 01 23 28 37 37 ff ff 28 51 ff ff ff ff ff ff ff 
03: ff 00 ff ff 00 55 ff ff 05 01 00 00 00 00 00 00 
04: ff ff ff ff ff d9 03 ff ff ff ff ff ff ff 01 ff 
05: 00 00 00 5c 00 5f 00 ff ff ff ff ff ff ff ff ff 
06: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
07: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
08: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
09: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0a: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0b: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0c: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0d: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0e: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0f: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
Bank 2:
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 
00: 8c 32 00 01 01 33 01 3c ff 33 ff ff 00 ff ff ff 
01: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
02: 01 32 5a 5a 5a ff ff 33 ff ff ff ff ff ff ff ff 
03: ff 00 ff ff 00 5f ff ff 03 01 00 00 00 00 00 00 
04: ff ff ff ff ff 2c 05 ff ff ff ff ff ff ff 02 ff 
05: 21 00 00 5c 00 5f 00 ff ff ff ff ff ff ff ff ff 
06: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
07: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
08: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
09: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0a: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0b: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0c: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0d: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0e: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0f: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
Bank 3:
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 
00: 03 00 00 0a 0a 01 01 3c ff ff ff ff 00 ff ff ff 
01: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
02: 00 19 23 2d 37 ff ff 8c aa c8 e6 ff ff ff ff ff 
03: ff 00 ff ff 00 3c ff ff 00 01 00 00 00 00 00 00 
04: ff ff ff ff ff 48 04 ff ff ff ff ff ff ff 03 ff 
05: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
06: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
07: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
08: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
09: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0a: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0b: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0c: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0d: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0e: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0f: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
Bank 4:
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 
00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
01: 00 00 00 00 00 00 00 00 00 00 00 ff ff ff ff ff 
02: ff ff ff 96 64 96 64 e1 96 ff ff ff ff ff ff ff 
03: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
04: 3f 00 03 ff ff ff ff ff ff ff ff ff ff ff 04 ff 
05: 31 13 ff ff 00 00 00 ff 00 20 12 00 09 ff ff ff 
06: ff 01 00 00 00 00 00 00 ff ff ff ff ff ff ff ff 
07: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
08: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
09: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0a: ff ff ff ff ff ff ff

RE: [lm-sensors] 3.10: NCT6776F sensor question with Supermicro X9SRL-F motherboard

2013-07-03 Thread Justin Piszcz
Re-sending as text.

From: Justin Piszcz [mailto:jpis...@lucidpixels.com] 
Sent: Wednesday, July 03, 2013 5:00 PM
To: 'Guenter Roeck'
Cc: linux-kernel@vger.kernel.org; lm-sens...@lm-sensors.org
Subject: RE: [lm-sensors] 3.10: NCT6776F sensor question with Supermicro
X9SRL-F motherboard



-Original Message-
From: Guenter Roeck [mailto:li...@roeck-us.net] 
Sent: Wednesday, July 03, 2013 4:57 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; lm-sens...@lm-sensors.org
Subject: Re: [lm-sensors] 3.10: NCT6776F sensor question with Supermicro
X9SRL-F motherboard

[ .. ]

> 09: 0a 00 00 00 00 0a 0a 0a 0a aa ef 80 ff 40 46 c4 
> 0a: 0e 01 00 00 ff 00 00 ff 00 00 80 66 66 06 01 01 
^^
This shows that PECI Agent 0 is supposed to be enabled.

> Bank 2:
> 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 
> 00: 8c 32 00 01 01 33 01 3c ff 33 ff ff 00 ff ff ff 
  ^^
This value suggests that the second temperature sensor (the one creating
the alarm) is supposed to be the PECI source (which reports the CPU
temperature
to the NCT6776), and that it is supposed to be used to control the speed
of the CPU fan.
    ^^
Fan control is in manual mode. Did you set this ?
It is quite unusual.

Setting:    
Current FAN Mode is Optimal.
Set Fan to Standard Speed
Set Fan to Full Speed
Set Fan to Optimal Speed

> Bank 7:
> 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 
> 00: ff 95 02 10 00 00 00 00 00 64 00 00 00 00 00 00 
> 01: 00 00 00 00 00 00 00 f8 80 f8 80 f8 80 f8 80 00 
   ^^
00 here shows that the PECI source is not active, ie the CPU does not
deliver
PECI data to the NCT6776. This explains the alarm.

Practical impact is probably limited as fan control is configured to be
manual
anyway, but I wonder why PECI doesn't work on your board. PECI configuration
is identical to my Supermicro board.

Guenter


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.9.8: cannot boot with iommu=on using 6gbps highpoint sata card

2013-06-28 Thread Justin Piszcz
Hello,

This bug I reported in November 2012:
http://www.kernelhub.org/?p=2&msg=170797
http://home.comcast.net/~jpiszcz/20121128/highpoint.jpg

Some discussion on the patches:
http://www.kernelhub.org/?p=2&msg=172597
http://www.kernelhub.org/?p=2&msg=171846
http://www.kernelhub.org/?p=2&msg=172346

My current choices:
1. Stick with old kernel (3.7.x) + patch, which works w/IOMMU=on and use
6gbps SATA board.
2. Put drive back on motherboard SATA2 and the machine will boot properly
with any kernel.
3. I would prefer to use 6gbps if possible.

There is still a problem with 3.9.8: (below)

Rebooting to new kernel (3.9.8) from the 3.7.x kernel w/ patch provided
earlier:

Jun 28 07:48:54 p34 [ 5809.227191] EXT4-fs (sdc2): re-mounted. Opts: (null) 
Jun 28 07:48:56 p34 [ 5811.283882] sd 7:0:0:0: [sdc] Synchronizing SCSI
cache 
Jun 28 07:48:56 p34 [ 5811.284402] sd 0:0:1:0: [sdb] Synchronizing SCSI
cache 
Jun 28 07:48:56 p34 [ 5811.284522] sd 0:0:0:0: [sda] Synchronizing SCSI
cache 
Jun 28 07:48:56 p34 [ 5811.284911] 3w-sas: Shutting down host 0. 
Jun 28 07:49:03 p34 [ 5818.107989] 3w-sas: Shutdown complete. 
Jun 28 07:50:50 p34 [6.127384] usb 6-2.1: new full-speed USB device
number 4 using uhci_hcd 
Jun 28 07:50:50 p34 [6.248468] usb 6-2.1: New USB device found,
idVendor=413c, idProduct=2002 
Jun 28 07:50:50 p34 [6.249011] usb 6-2.1: New USB device strings: Mfr=1,
Product=2, SerialNumber=0 
Jun 28 07:50:50 p34 [6.250654] usb 6-2.1: Product: Dell USB Keyboard Hub

Jun 28 07:50:50 p34 [6.252476] usb 6-2.1: Manufacturer: Dell 
Jun 28 07:50:50 p34 [6.264087] input: Dell Dell USB Keyboard Hub as
/devices/pci:00/:00:1d.0/usb6/6-2/6-2.1/6-2.1:1.0/input/input5 
Jun 28 07:50:50 p34 [6.265041] hid-generic 0003:413C:2002.0004:
input,hidraw3: USB HID v1.10 Keyboard [Dell Dell USB Keyboard Hub] on
usb-:00:1d.0-2.1/input0 
Jun 28 07:50:50 p34 [6.274732] input: Dell Dell USB Keyboard Hub as
/devices/pci:00/:00:1d.0/usb6/6-2/6-2.1/6-2.1:1.1/input/input6 
Jun 28 07:50:50 p34 [6.275752] hid-generic 0003:413C:2002.0005:
input,hidraw4: USB HID v1.10 Device [Dell Dell USB Keyboard Hub] on
usb-:00:1d.0-2.1/input1 
Jun 28 07:50:50 p34 [6.343561] usb 6-2.3: new low-speed USB device
number 5 using uhci_hcd 
Jun 28 07:50:50 p34 [6.476670] usb 6-2.3: New USB device found,
idVendor=045e, idProduct=0040 
Jun 28 07:50:50 p34 [6.477220] usb 6-2.3: New USB device strings: Mfr=1,
Product=3, SerialNumber=0 
Jun 28 07:50:50 p34 [6.479064] usb 6-2.3: Product: Microsoft 3-Button
Mouse with IntelliEye(TM) 
Jun 28 07:50:50 p34 [6.480944] usb 6-2.3: Manufacturer: Microsoft 
Jun 28 07:50:50 p34 [6.500368] input: Microsoft Microsoft 3-Button Mouse
with IntelliEye(TM) as
/devices/pci:00/:00:1d.0/usb6/6-2/6-2.3/6-2.3:1.0/input/input7 
Jun 28 07:50:50 p34 [6.501539] hid-generic 0003:045E:0040.0006:
input,hidraw5: USB HID v1.10 Mouse [Microsoft Microsoft 3-Button Mouse with
IntelliEye(TM)] on usb-:00:1d.0-2.3/input0 
Jun 28 07:50:52 p34 [7.840750] ata14.00: qc timeout (cmd 0xa1) 
Jun 28 07:50:52 p34 [7.848753] ata7.00: qc timeout (cmd 0xec) 
Jun 28 07:50:52 p34 [7.849304] ata7.00: failed to IDENTIFY (I/O error,
err_mask=0x4) 
Jun 28 07:50:52 p34 [8.156029] ata7: SATA link up 6.0 Gbps (SStatus 133
SControl 300) 
Jun 28 07:50:52 p34 [8.341165] ata14.00: failed to IDENTIFY (I/O error,
err_mask=0x4) 
Jun 28 07:50:53 p34 [9.146828] ata14: SATA link up 1.5 Gbps (SStatus 113
SControl 300) 
Jun 28 07:51:02 p34 [   18.164175] ata7.00: qc timeout (cmd 0xec) 
Jun 28 07:51:02 p34 [   18.164740] ata7.00: failed to IDENTIFY (I/O error,
err_mask=0x4) 
Jun 28 07:51:02 p34 [   18.166487] ata7: limiting SATA link speed to 3.0
Gbps 
Jun 28 07:51:02 p34 [   18.473437] ata7: SATA link up 6.0 Gbps (SStatus 133
SControl 320) 
Jun 28 07:51:03 p34 [   19.154971] ata14.00: qc timeout (cmd 0xa1) 
Jun 28 07:51:04 p34 [   19.655389] ata14.00: failed to IDENTIFY (I/O error,
err_mask=0x4) 
Jun 28 07:51:04 p34 [   19.656004] ata14: limiting SATA link speed to 1.5
Gbps 
Jun 28 07:51:04 p34 [   20.463046] ata14: SATA link up 1.5 Gbps (SStatus 113
SControl 310) 
Jun 28 07:51:32 p34 [   48.497977] ata7.00: qc timeout (cmd 0xec) 
Jun 28 07:51:32 p34 [   48.498547] ata7.00: failed to IDENTIFY (I/O error,
err_mask=0x4) 
Jun 28 07:51:33 p34 [   48.805177] ata7: SATA link up 6.0 Gbps (SStatus 133
SControl 320) 
Jun 28 07:51:34 p34 [   50.487585] ata14.00: qc timeout (cmd 0xa1) 
Jun 28 07:51:35 p34 [   50.987948] ata14.00: failed to IDENTIFY (I/O error,
err_mask=0x4) 
Jun 28 07:51:36 p34 [   51.793613] ata14: SATA link up 1.5 Gbps (SStatus 113
SControl 310) 
Jun 28 07:51:36 p34 [   52.294202] VFS: Cannot open root device "822" or
unknown-block(8,34): error -6 
Jun 28 07:51:36 p34 [   52.294770] Please append a correct "root=" boot
option; here are the available partitions: 
Jun 28 07:51:36 p34 [   52.296927] 0800  29296762880 sda 
Jun 28 07:51:36 p34  driver: sd 
Jun 28 07:5

Re: 3.10: discard/trim support on md-raid1?

2013-07-16 Thread Justin Piszcz
Thanks for the replies,

After some further testing..
When I ran a repair on the md's sync_action, the system would reduce
I/O to the RAID-1 to 14kb/s or even less when it hit a certain number
of blocks and effectively locked the system every time.
It turned out to be a bad SSD (it also failed Intel's Secure Erase), I RMA'd it.
Interesting though that it did not drop out of the array but froze the
system (the failure scenario was odd).

Justin.



On Tue, Jul 16, 2013 at 3:15 AM, NeilBrown  wrote:
> On Sat, 13 Jul 2013 06:34:19 -0400 "Justin Piszcz" 
> wrote:
>
>> Hello,
>>
>> Running 3.10 and I see the following for an md-raid1 of two SSDs:
>>
>> Checking /sys/block/md1/queue:
>> add_random: 0
>> discard_granularity: 512
>> discard_max_bytes: 2147450880
>> discard_zeroes_data: 0
>> hw_sector_size: 512
>> iostats: 0
>> logical_block_size: 512
>> max_hw_sectors_kb: 32767
>> max_integrity_segments: 0
>> max_sectors_kb: 512
>> max_segment_size: 65536
>> max_segments: 168
>> minimum_io_size: 512
>> nomerges: 0
>> nr_requests: 128
>> optimal_io_size: 0
>> physical_block_size: 512
>> read_ahead_kb: 8192
>> rotational: 1
>> rq_affinity: 0
>> scheduler: none
>> write_same_max_bytes: 0
>>
>> What should be seen:
>> rotational: 0
>
> What has "rotational" got to do with "supports discard"?
> There may be some correlation, but it isn't causal.
>
>> And possibly:
>> discard_zeroes_data: 1
>
> This should be set as the 'or' of the same value from component devices.  And
> does not enable or disable the use of discard.
>
> I don't think that "does this device support discard" appears in sysfs.
>
> I believe trim does work on md/raid1 if the underlying devices all support it.
>
> NeilBrown
>
>
>
>>
>> Can anyone confirm if there is a workaround to allow TRIM when using
>> md-raid1?
>>
>> Some related discussion here:
>> http://us.generation-nt.com/answer/md-rotational-attribute-help-206571222.ht
>> ml
>> http://www.progtown.com/topic343938-ssd-strange-itself-conducts.html
>>
>>
>> Justin.
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)

2013-07-29 Thread Justin Piszcz


-Original Message-
From: NeilBrown [mailto:ne...@suse.de] 
Sent: Monday, July 29, 2013 1:57 AM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; linux-r...@vger.kernel.org
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
SSD)

On Fri, 26 Jul 2013 05:56:51 -0400 "Justin Piszcz" 
wrote:

[..]

Further testing shows all is ok now:


Sun Nov 25 02:12:03 EST 2012: Parity check(s) running, sleeping 60
seconds...
Sun Nov 25 02:13:03 EST 2012: Parity check(s) running, sleeping 60
seconds...
Sun Nov 25 02:14:03 EST 2012: cat /sys/block/md0/md/mismatch_cnt
Sun Nov 25 02:14:03 EST 2012: 0
Sun Nov 25 02:14:03 EST 2012: cat /sys/block/md1/md/mismatch_cnt
Sun Nov 25 02:14:03 EST 2012: 0
Sun Nov 25 02:14:03 EST 2012: The meta-device /dev/md0 has no mismatched
sectors.
Sun Nov 25 02:14:04 EST 2012: The meta-device /dev/md1 has no mismatched
sectors.
Sun Nov 25 02:14:05 EST 2012: All devices are clean...
Sun Nov 25 02:14:05 EST 2012: cat /sys/block/md0/md/mismatch_cnt
Sun Nov 25 02:14:05 EST 2012: 0
Sun Nov 25 02:14:05 EST 2012: cat /sys/block/md1/md/mismatch_cnt
Sun Nov 25 02:14:05 EST 2012: 0

Thanks for your help.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)

2013-07-21 Thread Justin Piszcz
Hi,

When I run repair on an MD-RAID1 sync_action, the speed slows down and it
stays like this (below) for hours.  

The system is then completely unresponsive to user input.  I have replaced a
failing SSD; however, after a check, mismatch_cnt seems to increase over
time.  When I run repair, the system freezes to user-input.  Has anyone else
run into this issue with a RAID-1 volume (2 x SSD) using 0.90 metadata?
Long ago I used to use this same configuration with two physical disks and
there was never a problem.

Even though I left a root shell open, this has no effect to break the
resync:
# echo idle > /sys/devices/virtual/block/md1/md/sync_action

Every 1.0s: cat /proc/mdstatSun Jul 21 06:15:38
2013

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
  233381376 blocks [2/2] [UU]
  [>]  resync =  0.0% (151616/233381376)
finish=36171.5min speed=107K/sec

md0 : active raid1 sdc1[0] sdb1[1]
  1048512 blocks [2/2] [UU]

unused devices: 

10 minutes later:

  233381376 blocks [2/2] [UU]
  [>]  resync =  0.0% (151616/233381376)
finish=52219.3min speed=74K/sec

Where it hangs (151616) or elsewhere, has been different each time I watched
it, it does not appear to be hanging at the same block each time.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)

2013-07-25 Thread Justin Piszcz


-Original Message-
From: NeilBrown [mailto:ne...@suse.de] 
Sent: Sunday, July 21, 2013 7:03 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; linux-r...@vger.kernel.org
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
SSD)

> Hi Justin,
>  this is a known bug.  Fix has been accepted into mainline for 3.11-rc2.
>  Hopefully it will get into 3.10.3 (too late for 3.10.2).

> NeilBrown


Hi Neil,

Did the fix by chance make it into 3.10.3?

The same issue occurs with 3.10.3 for me as well:

Every 1.0s: cat /proc/mdstatThu Jul 25 19:09:46
2013

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
  233381376 blocks [2/2] [UU]
  [>]  resync =  0.0% (151488/233381376)
finish=32045.3m
in speed=121K/sec

md0 : active raid1 sdc1[0] sdb1[1]
  1048512 blocks [2/2] [UU]

unused devices: 



Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)

2013-07-26 Thread Justin Piszcz


-Original Message-
From: NeilBrown [mailto:ne...@suse.de] 
Sent: Thursday, July 25, 2013 8:36 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; linux-r...@vger.kernel.org
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
SSD)

On Thu, 25 Jul 2013 19:10:50 -0400 "Justin Piszcz" 
wrote:

> Did the fix by chance make it into 3.10.3?

No, it looks like it missed again.  I gather there was a large inflow of
patches for -stable in the 3.11-rc1 merge window and Greg has been
processing
them in batches.  Hopefully in 3.10.4.

The relevant patch is commit 30bc9b53878a9921b02e3 in mainline.

NeilBrown

--

Method to get patch via git and patch kernel:

$ git clone
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git log |grep 30bc9b53878a9921b02e3
commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
$ git show 30bc9b53878a9921b02e3b5bc4283ac1c6de102a > /tmp/a
# patch -p1 < /tmp/a
patching file drivers/md/raid1.c
Hunk #1 succeeded at 1848 (offset -1 lines).
Hunk #2 succeeded at 1886 (offset -1 lines).
Hunk #3 succeeded at 1915 (offset -1 lines).

Reboot- tested, success, thanks..!

One follow-up question:
$ cat /sys/block/md1/md/mismatch_cnt
314112
-> On a live RAID-1 (root filesystem) without swap, is it normal to have
such a high mismatch_cnt even after a repair?

First repair:
Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
314112 sectors.
Second repair:
Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
313600 sectors.

Should I be concerned?


Testing the patch:

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
  233381376 blocks [2/2] [UU]
  [>]  check =  0.3% (838976/233381376)
finish=9.2min speed=419488K/sec

md0 : active raid1 sdc1[0] sdb1[1]
  1048512 blocks [2/2] [UU]

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
  233381376 blocks [2/2] [UU]
  [===>.]  check = 77.5% (180889856/233381376)
finish=2.5min speed=342654K/sec

md0 : active raid1 sdc1[0] sdb1[1]
  1048512 blocks [2/2] [UU]

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
  233381376 blocks [2/2] [UU]

md0 : active raid1 sdc1[0] sdb1[1]
  1048512 blocks [2/2] [UU]


Justin.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] Quirk to support Marvell 88SE91xx SATA controllers with Intel IOMMU.

2013-03-01 Thread Justin Piszcz


-Original Message-
From: Andrew Cooks [mailto:aco...@gmail.com] 
Sent: Friday, March 01, 2013 3:26 AM
To: aco...@gmail.com; j...@8bytes.org; xjtuy...@hotmail.com;
gm.y...@gmail.com; alex.william...@redhat.com; bhelg...@google.com;
jpis...@lucidpixels.com; dw...@infradead.org
Cc: open list:INTEL IOMMU (VT-d); open list; open list:PCI SUBSYSTEM
Subject: [PATCH] Quirk to support Marvell 88SE91xx SATA controllers with
Intel IOMMU.

This is my third submitted patch to make Marvell 88SE91xx SATA controllers
work when IOMMU is enabled.[1][2]

What's changed:
* Adopt David Woodhouse's terminology by referring to the quirky functions
as 'ghost' functions.
* Unmap ghost functions when device is detached from IOMMU.
* Stub function for when CONFIG_PCI_QUIRKS is not enabled.

The bad:
* Still no AMD support.
* The table of affected chip IDs is as complete as I can make it by googling
for bug reports.

This patch was generated against commit
b0af9cd9aab60ceb17d3ebabb9fdf4ff0a99cf50, but will also apply cleanly to
3.7.10.

--

Hi,

Against 3.7.10:

# patch -p1 <
../RFC-Fix-Intel-IOMMU-support-for-Marvell-88SE91xx-SATA-controllers..patch 
patching file drivers/iommu/intel-iommu.c
patching file drivers/pci/quirks.c
Hunk #1 succeeded at 3230 (offset 3 lines).
patching file include/linux/pci.h
#

Recompile kernel, reboot..

Shutdown host, re-attach to Marvell Controller w/IOMMU.

The host still failed to boot, dmesg/panic here:
http://home.comcast.net/~jpiszcz/20130301/boot_failure.JPG

(The root disk is /dev/sdc)

I recompiled again with IOMMU off and it booted ok:

# uname -a
Linux host 3.7.10 #2 SMP Fri Mar 1 12:44:25 EST 2013 x86_64 GNU/Linux

Here is the part of dmesg (what it looks like when it succeeds with
IOMMU=off)

[4.288113] input: American Megatrends Inc. Virtual Keyboard and Mouse as
/devices/pci:00/:00:1a.1/usb4/4-2/4-2:1.0/input/input3
[4.289025] hid-generic 0003:046B:FF10.0001: input,hidraw0: USB HID v1.10
Keyboard [American Megatrends Inc. Virtual Keyboard and Mouse] on
usb-:00:1a.1-2/input0
[4.305993] input: American Megatrends Inc. Virtual Keyboard and Mouse as
/devices/pci:00/:00:1a.1/usb4/4-2/4-2:1.1/input/input4
[4.307106] hid-generic 0003:046B:FF10.0002: input,hidraw1: USB HID v1.10
Mouse [American Megatrends Inc. Virtual Keyboard and Mouse] on
usb-:00:1a.1-2/input1
[4.326481] ata6: SATA link down (SStatus 0 SControl 300)
[4.327324] scsi 7:0:0:0: Direct-Access ATA  INTEL SSDSC2MH25
PWG4 PQ: 0 ANSI: 5
[4.329953] sd 7:0:0:0: [sdc] 488397168 512-byte logical blocks: (250
GB/232 GiB)
[4.330639] scsi 14:0:0:0: Processor Marvell  91xx Config
1.01 PQ: 0 ANSI: 5
[4.333276] sd 7:0:0:0: [sdc] Write Protect is off
[4.334746] sd 7:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[4.334921] sd 7:0:0:0: [sdc] Write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
[4.345622]  sdc: sdc1 sdc2
[4.347493] sd 7:0:0:0: [sdc] Attached SCSI disk

Justin.
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] Quirk to support Marvell 88SE91xx SATA controllers with Intel IOMMU.

2013-03-01 Thread Justin Piszcz


-Original Message-
From: Andrew Cooks [mailto:aco...@gmail.com] 
Sent: Friday, March 01, 2013 5:19 PM
To: Justin Piszcz
Cc: Joerg Roedel; YingChu; Chu Ying; Alex Williamson; bhelg...@google.com;
David Woodhouse; open list:INTEL IOMMU (VT-d); open list; open list:PCI
SUBSYSTEM
Subject: Re: [PATCH] Quirk to support Marvell 88SE91xx SATA controllers with
Intel IOMMU.

On Sat, Mar 2, 2013 at 1:51 AM, Justin Piszcz 
wrote:
>
>

> Thanks for testing!
No problem.

Against a clean 3.7.10 (from ftp.kernel.org)

# patch -p1 <
../patch/RFC-Fix-Intel-IOMMU-support-for-Marvell-88SE91xx-SATA-controllers..
patch 
patching file drivers/iommu/intel-iommu.c
patching file drivers/pci/quirks.c
Hunk #1 succeeded at 3230 (offset 3 lines).
patching file include/linux/pci.h
# pwd
/usr/src/linux-3.7.10

Full dmesg with the patch applied: (but with IOMMU off)
http://home.comcast.net/~jpiszcz/20130301/dmesg-full.txt

Full dmesg (as much as possible through netconsole with IOMMU on)
http://home.comcast.net/~jpiszcz/20130301/dmesg-iommu-on.txt

Let me know if anything else is needed, thanks.

Justin.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] Quirk to support Marvell 88SE91xx SATA controllers with Intel IOMMU.

2013-03-04 Thread Justin Piszcz


On Sat, Mar 2, 2013 at 7:18 AM, Justin Piszcz 
wrote:
>
> Against a clean 3.7.10 (from ftp.kernel.org)
>
> # patch -p1 <
>
../patch/RFC-Fix-Intel-IOMMU-support-for-Marvell-88SE91xx-SATA-controllers..
> patch
> patching file drivers/iommu/intel-iommu.c
> patching file drivers/pci/quirks.c
> Hunk #1 succeeded at 3230 (offset 3 lines).
> patching file include/linux/pci.h
> # pwd
> /usr/src/linux-3.7.10
>

I've downloaded and patched the 3.7.10 tarball and still get the same
output I got before; different output from yours. I'm not sure the
patch is complete or applying correctly, are you?
Could you please check whether the patch you're applying is the same
as the attached file?

Hi,

Success!

Patch from e-mail:
# md5sum marvell_ghost_funcs.patch 
718bfb5876e3538ec23a516ef28d03f5  marvell_ghost_funcs.patch

Kernel from ftp.kernel.org:
# md5sum linux-3.7.10.tar.bz2
56ec294a922b6112a1ef129668f38a83  linux-3.7.10.tar.bz2

Decompress, patch, re-compile w/IOMMU=on.

# tar jxf linux-3.7.10.tar.bz2 ; ln -s linux-3.7.10 linux   
# cd linux; patch -p1 < ../marvell_ghost_funcs.patch
patching file drivers/iommu/intel-iommu.c
Hunk #1 succeeded at 1672 (offset -2 lines).
Hunk #2 succeeded at 1729 (offset -2 lines).
Hunk #3 succeeded at 3833 (offset -2 lines).
patching file drivers/pci/quirks.c
Hunk #1 succeeded at 3210 (offset -39 lines).
Hunk #2 succeeded at 3240 (offset -39 lines).
Hunk #3 succeeded at 3258 (offset -39 lines).
patching file include/linux/pci.h
Hunk #1 succeeded at 1546 (offset -32 lines).
Hunk #2 succeeded at 1555 (offset -32 lines).
patching file include/linux/pci_ids.h

Reboot, re-test.

# lilo
Added 3.7.7-1
Added 3.7.10-5-ioff
Added 3.7.10-7(iommu=off w/patch) = OK
Added 3.7.10-8  * (iommu=on w/patch) = OK

dmesg w/patch + iommu
http://home.comcast.net/~jpiszcz/20130304/dmesg-success-patch.txt

Thanks!

Justin.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Quirk to support Marvell 88SE91xx SATA controllers with Intel IOMMU.

2013-03-29 Thread Justin Piszcz
On Sun, Mar 3, 2013 at 8:35 PM, Andrew Cooks  wrote:
> On Sat, Mar 2, 2013 at 7:18 AM, Justin Piszcz  wrote:
>>
>> Against a clean 3.7.10 (from ftp.kernel.org)
>>
>> # patch -p1 <
>> ../patch/RFC-Fix-Intel-IOMMU-support-for-Marvell-88SE91xx-SATA-controllers..
>> patch
>> patching file drivers/iommu/intel-iommu.c
>> patching file drivers/pci/quirks.c
>> Hunk #1 succeeded at 3230 (offset 3 lines).
>> patching file include/linux/pci.h
>> # pwd
>> /usr/src/linux-3.7.10
>>
>
> I've downloaded and patched the 3.7.10 tarball and still get the same
> output I got before; different output from yours. I'm not sure the
> patch is complete or applying correctly, are you?
> Could you please check whether the patch you're applying is the same

Hi,

As this patch is now working for some time (against 3.7.x), I was
wondering when it was going to be included in mainline?

I had upgraded to 3.8.x and rebooted and the same problem recurred and
had to revert back to 3.7.x.

Thanks,

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.5.1 kernel: Oops + stracktrace + ext4 kernel errors!

2012-08-24 Thread Justin Piszcz
Hello,

Thoughts?

Saw this when trying to copy files to array with Samba and doing file
operations:

[28939.505792] [ cut here ]
[28939.505818] WARNING: at include/linux/iocontext.h:140
copy_process.part.50+0x115e/0x1220()
[28939.505826] Hardware name: X8DTH-i/6/iF/6F
[28939.505833] Pid: 16976, comm: dump Not tainted 3.5.1+ #3
[28939.505838] Call Trace:
[28939.505847]  [] warn_slowpath_common+0x75/0xb0
[28939.505855]  [] warn_slowpath_null+0x15/0x20
[28939.505862]  [] copy_process.part.50+0x115e/0x1220
[28939.505869]  [] do_fork+0x13b/0x2f0
[28939.505880]  [] ? recalc_sigpending+0x12/0x30
[28939.505888]  [] ? __set_current_blocked+0x3a/0x60
[28939.505898]  [] sys_clone+0x23/0x30
[28939.505908]  [] stub_clone+0x13/0x20
[28939.505916]  [] ? system_call_fastpath+0x1a/0x1f
[28939.505922] ---[ end trace bb4eebc57a10f73a ]---

[29113.279716] 3w-sas: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=0.
[29367.345433] BUG: unable to handle kernel NULL pointer dereference
at 0028
[29367.345455] IP: [] ext4_ext_remove_space+0x89c/0xc90
[29367.345471] PGD c1ef31067 PUD aa4435067 PMD 0
[29367.345485] Oops:  [#1] SMP
[29367.345495] CPU 4
[29367.345503] Pid: 16922, comm: rsync Tainted: GW3.5.1+
#3 Supermicro X8DTH-i/6/iF/6F/X8DTH
[29367.345520] RIP: 0010:[]  []
ext4_ext_remove_space+0x89c/0xc90
[29367.345534] RSP: 0018:880a7db79c98  EFLAGS: 00010246
[29367.345542] RAX:  RBX: 0002 RCX: 0003c06c3600
[29367.345550] RDX: 0001 RSI: 0001f4b88bf3 RDI: 0002
[29367.345558] RBP: 880a7db79d88 R08: c06c3600 R09: 8806245245c0
[29367.345566] R10:  R11:  R12: 0001
[29367.345574] R13: 8806245245f0 R14: 88029948b0cc R15: 8800b53596f0
[29367.345582] FS:  7f8b5c30e700() GS:88063fc8()
knlGS:
[29367.345593] CS:  0010 DS:  ES:  CR0: 8005003b
[29367.345598] CR2: 0028 CR3: 000b59a6c000 CR4: 07e0
[29367.345604] DR0:  DR1:  DR2: 
[29367.345609] DR3:  DR6: 0ff0 DR7: 0400
[29367.345615] Process rsync (pid: 16922, threadinfo 880a7db78000,
task 8800bb318cf0)
[29367.345621] Stack:
[29367.345624]  880a7db79cd8 8116821b 880a7db79ce8
8800b53596f0
[29367.345638]  8808840eb600 880a7db79d40 88062554a000
8800fff5
[29367.345651]  880a7db79d28 8800b5359660 88062554a000
880624524620
[29367.345664] Call Trace:
[29367.345671]  [] ? __ext4_handle_dirty_metadata+0x7b/0x100
[29367.345678]  [] ext4_ext_truncate+0x173/0x1b0
[29367.345685]  [] ? ext4_mark_inode_dirty+0x66/0x170
[29367.345693]  [] ext4_truncate+0x5d/0x70
[29367.345699]  [] ext4_evict_inode+0x378/0x3d0
[29367.345707]  [] evict+0xaa/0x1a0
[29367.345713]  [] iput+0x103/0x210
[29367.345720]  [] do_unlinkat+0x154/0x1c0
[29367.345729]  [] ? vfs_write+0x118/0x160
[29367.345739]  [] ? sys_write+0x45/0xa0
[29367.345745]  [] sys_unlink+0x11/0x20
[29367.345753]  [] system_call_fastpath+0x1a/0x1f
[29367.345759] Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49
89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 c5 f8 ff ff 0f 1f 00
49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 9c f8 ff ff 0f 1f 80 00 00 00
00 41
[29367.345874] RIP  [] ext4_ext_remove_space+0x89c/0xc90
[29367.345881]  RSP 
[29367.345885] CR2: 0028
[29367.345890] ---[ end trace bb4eebc57a10f73b ]---
[35775.632435] 3w-sas: scsi0: AEN: INFO (0x04:0x002B): Verify completed:unit=0.
[39395.965177] 3w-sas: scsi0: AEN: INFO (0x04:0x0091): Unit now in
standby mode:unit=0.
[50143.132858] 3w-sas: scsi0: AEN: INFO (0x04:0x0090): Unit now in
active mode:unit=0.

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.5.1 kernel: Oops + stracktrace + ext4 kernel errors!

2012-08-24 Thread Justin Piszcz
On Fri, Aug 24, 2012 at 11:31 AM, Justin Piszcz  wrote:
> Hello,
>
> Thoughts?

..

Going back to XFS.
EXT4 appears unstable after more than 16TB is on the array (of 60TB ext4fs).

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.5.1 kernel: Oops + stracktrace + ext4 kernel errors!

2012-08-24 Thread Justin Piszcz


-Original Message-
From: Theodore Ts'o [mailto:ty...@mit.edu] 
Sent: Friday, August 24, 2012 6:39 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; linux-e...@vger.kernel.org; al piszcz
Subject: Re: 3.5.1 kernel: Oops + stracktrace + ext4 kernel errors!

On Fri, Aug 24, 2012 at 11:31:44AM -0400, Justin Piszcz wrote:
> Hello,
> 
> Thoughts?
> 
> Saw this when trying to copy files to array with Samba and doing file
> operations:
> 
> [28939.505792] [ cut here ]
> [29367.345433] BUG: unable to handle kernel NULL pointer dereference
> at 0028
> [29367.345455] IP: [] ext4_ext_remove_space+0x89c/0xc90

Fixed by commit 89a4e48f84 in upstream.  It is scheduled for inclusion
in the a stable kernel series; I believe it should be in 3.5.3.

Regards,

- Ted


--

Thanks.. if/when I come across another box I can test with I will ensure
that patch (89a4e48f84 ) gets applied.  For PROD hosts I need stability >
16T.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.5.2: moving files from xfs/disk -> nfs: radix_tree_lookup_slot+0xe/0x10

2012-08-27 Thread Justin Piszcz
Hi,

Moving ~276GB of files (mainly large backups) and everything has
seemed to lockup on the client moving data to the server, it is still
in this state..

[75716.705697] INFO: task sync:8790 blocked for more than 120 seconds.
[75716.705701] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[75716.705703] syncD 88040ec54830 0  8790   2141 0x
[75716.705708]  88001fff1d08 0086 81862fc0
88001fff1fd8
[75716.705713]  88001fff1fd8 4000 88041d958670
88040ec54830
[75716.705716]  88001fff1c38 812dcaee 88001fff1c58
88001fff1d78
[75716.705720] Call Trace:
[75716.705729]  [] ? radix_tree_lookup_slot+0xe/0x10
[75716.705733]  [] ? find_get_pages_tag+0xc6/0x150
[75716.705738]  [] ? __enqueue_entity+0x70/0x80
[75716.705742]  [] ? __sync_filesystem+0x90/0x90
[75716.705747]  [] schedule+0x24/0x70
[75716.705751]  [] schedule_timeout+0x1a9/0x210
[75716.705755]  [] ? calc_period_shift+0x60/0x60
[75716.705760]  [] ? check_preempt_curr+0x75/0xa0
[75716.705764]  [] wait_for_common+0xc0/0x150
[75716.705767]  [] ? try_to_wake_up+0x280/0x280
[75716.705770]  [] ? __sync_filesystem+0x90/0x90
[75716.705773]  [] wait_for_completion+0x18/0x20
[75716.705777]  [] writeback_inodes_sb_nr+0x77/0xa0
[75716.705782]  [] ?
shrink_dcache_for_umount_subtree+0x111/0x1d0
[75716.705785]  [] writeback_inodes_sb+0x29/0x40
[75716.705788]  [] __sync_filesystem+0x47/0x90
[75716.705791]  [] sync_one_sb+0x1b/0x20
[75716.705795]  [] iterate_supers+0xe1/0xf0
[75716.705798]  [] sys_sync+0x2b/0x60
[75716.705802]  [] system_call_fastpath+0x1a/0x1f
[75836.701197] INFO: task sync:8790 blocked for more than 120 seconds.

Thoughts?

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Upgraded from 3.4 to 3.5.1 kernel: machine does not boot

2012-08-10 Thread Justin Piszcz
Hello,

Motherboard: Supermicro X8DTH-6F
Distro: Debian Testing x86_64

>From 3.4 -> 3.5.1 on x86_64 make oldconfig and a few minor changes and the
machine attempts to boot but hangs at the filesystem mounting part of the
boot process.

Picture of where it stops working (a little burry but readable)
http://home.comcast.net/~jpiszcz/20120810/3.5-kernel-hangs.jpg

Kernel config 3.4 (working)
http://home.comcast.net/~jpiszcz/20120810/config-3.4.txt

Kernel config 3.5.1 (hangs)
http://home.comcast.net/~jpiszcz/20120810/config-3.5.1.txt

As you see towards the end the machine has been sitting there for 1 hour as
that's the timeout I have the drives spindown on the 3ware card.

Any thoughts as what is wrong here?

Diff between the two:

$ diff -u config-3.4.txt  config-3.5.1.txt  |grep '^+C'
+CONFIG_ARCH_SUPPORTS_UPROBES=y
+CONFIG_BUILDTIME_EXTABLE_SORT=y
+CONFIG_CLOCKSOURCE_WATCHDOG=y
+CONFIG_ARCH_CLOCKSOURCE_DATA=y
+CONFIG_GENERIC_TIME_VSYSCALL=y
+CONFIG_GENERIC_CLOCKEVENTS=y
+CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
+CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
+CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
+CONFIG_GENERIC_CMOS_UPDATE=y
+CONFIG_TICK_ONESHOT=y
+CONFIG_NO_HZ=y
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_RCU_FANOUT_LEAF=16
+CONFIG_GENERIC_SMP_IDLE_THREAD=y
+CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
+CONFIG_SECCOMP_FILTER=y
+CONFIG_CROSS_MEMORY_ATTACH=y
+CONFIG_X86_DEV_DMA_OPS=y
+CONFIG_NETFILTER_NETLINK=y
+CONFIG_NF_CT_NETLINK=y
+CONFIG_HAVE_BPF_JIT=y
+CONFIG_E1000E=y
+CONFIG_IXGBE_HWMON=y
+CONFIG_NET_VENDOR_I825XX=y
+CONFIG_HID=y
+CONFIG_HIDRAW=y
+CONFIG_HID_GENERIC=y
+CONFIG_USB_HID=y
+CONFIG_HID_PID=y
+CONFIG_USB_HIDDEV=y
+CONFIG_NEW_LEDS=y
+CONFIG_LEDS_CLASS=y
+CONFIG_NFS_V2=y
+CONFIG_PANIC_ON_OOPS_VALUE=0
+CONFIG_RCU_CPU_STALL_INFO=y
+CONFIG_CRYPTO_CRC32C=y
+CONFIG_GENERIC_STRNCPY_FROM_USER=y
+CONFIG_GENERIC_STRNLEN_USER=y

Justin.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Upgraded from 3.4 to 3.5.1 kernel: machine does not boot

2012-08-10 Thread Justin Piszcz
On Fri, Aug 10, 2012 at 1:53 PM, Jesper Juhl  wrote:
> On Fri, 10 Aug 2012, Justin Piszcz wrote:
>
>> Hello,
>>
>> Motherboard: Supermicro X8DTH-6F
>> Distro: Debian Testing x86_64
>>
>> >From 3.4 -> 3.5.1 on x86_64 make oldconfig and a few minor changes and the
>> machine attempts to boot but hangs at the filesystem mounting part of the
>> boot process.

Hi,

Found the root cause, the 3.5.1 kernel cannot mount my ext4 filesystem (60TB).

The 3.4 kernel works fine.

This is proven by commenting out the filesystem in /etc/fstab with
3.5.1, and all is OK.

When I run mount for that filesystem, it hangs, I ran alt+sysrq+t to
get additional output and I have pasted it below with the 3.5.1
kernel:

[  160.373406] mount   R  running task0  4361   4355 0x
[  160.373407]  8806266bdb68 0086 8806266bdaa8
8806266bdfd8
[  160.373410]  8806266bdfd8 4000 8806270b0600
880626c73a10
[  160.373413]  00011240 880c260177c0 880c260177c0

[  160.373415] Call Trace:
[  160.373416]  [] ? __schedule+0x299/0x770
[  160.373418]  [] __cond_resched+0x25/0x40
[  160.373420]  [] _cond_resched+0x2a/0x40
[  160.373421]  [] ext4_calculate_overhead+0x239/0x3e0
[  160.373425]  [] ext4_fill_super+0x1aa9/0x2930
[  160.373427]  [] mount_bdev+0x19f/0x1e0
[  160.373429]  [] ? ext4_calculate_overhead+0x3e0/0x3e0
[  160.373431]  [] ext4_mount+0x10/0x20
[  160.373433]  [] mount_fs+0x1b/0xd0
[  160.373434]  [] vfs_kern_mount+0x6f/0x110
[  160.373437]  [] do_kern_mount+0x4f/0x100
[  160.373439]  [] do_mount+0x2fe/0x8a0
[  160.373440]  [] ? strndup_user+0x53/0x70
[  160.373442]  [] sys_mount+0x90/0xe0
[  160.373443]  [] system_call_fastpath+0x1a/0x1f
[  160.373446] jbd2/sda1-8 S 880c2675f800 0  4362  2 0x
[  160.373448]  880623ca9e50 0046 880626c73a10
880623ca9fd8
[  160.373450]  880623ca9fd8 4000 8806271b9850
880626d08250
[  160.373453]  880623ca9da0 8806266bdbe0 880c2675f8a0
880c2675f888
[  160.373455] Call Trace:
[  160.373456]  [] ? default_wake_function+0xd/0x10
[  160.373458]  [] ? autoremove_wake_function+0x11/0x40
[  160.373460]  [] ? __wake_up_common+0x55/0x90
[  160.373462]  [] schedule+0x24/0x70
[  160.373463]  [] kjournald2+0x1ce/0x1e0
[  160.373465]  [] ? abort_exclusive_wait+0xb0/0xb0
[  160.373467]  [] ? commit_timeout+0x10/0x10
[  160.373469]  [] kthread+0x8e/0xa0
[  160.373471]  [] kernel_thread_helper+0x4/0x10
[  160.373472]  [] ? kthread_flush_work_fn+0x10/0x10
[  160.373474]  [] ? gs_change+0xb/0xb

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Upgraded from 3.4 to 3.5.1 kernel: machine does not boot

2012-08-10 Thread Justin Piszcz


-Original Message-
From: Justin Piszcz [mailto:jpis...@lucidpixels.com] 
Sent: Friday, August 10, 2012 5:46 PM
To: Jesper Juhl
Cc: linux-kernel@vger.kernel.org; a...@solarrain.com
Subject: Re: Upgraded from 3.4 to 3.5.1 kernel: machine does not boot

On Fri, Aug 10, 2012 at 1:53 PM, Jesper Juhl  wrote:
> On Fri, 10 Aug 2012, Justin Piszcz wrote:
>
>> Hello,
>>
>> Motherboard: Supermicro X8DTH-6F
>> Distro: Debian Testing x86_64
>>
>> >From 3.4 -> 3.5.1 on x86_64 make oldconfig and a few minor changes and
the
>> machine attempts to boot but hangs at the filesystem mounting part of the
>> boot process.

Hi,

Found the root cause, the 3.5.1 kernel cannot mount my ext4 filesystem
(60TB).

The 3.4 kernel works fine.

This is proven by commenting out the filesystem in /etc/fstab with
3.5.1, and all is OK.

--

Hi again,

I tested with linux-3.6-rc1:

The same problem, here is what I get from the strace:

irectory)
4434  readlink("/dev", 0x7fff3b05c670, 4096) = -1 EINVAL (Invalid argument)
4434  readlink("/dev/sda1", 0x7fff3b05c670, 4096) = -1 EINVAL (Invalid
argument)
4434  readlink("/r1", 0x7fff3b05c670, 4096) = -1 EINVAL (Invalid argument)
4434  getuid()  = 0
4434  geteuid() = 0
4434  getgid()  = 0
4434  getegid() = 0
4434  prctl(PR_GET_DUMPABLE)= 1
4434  lstat("/etc/mtab", {st_mode=S_IFLNK|0777, st_size=12, ...}) = 0
4434  getuid()  = 0
4434  geteuid() = 0
4434  getgid()  = 0
4434  getegid() = 0
4434  prctl(PR_GET_DUMPABLE)= 1
4434  stat("/run", {st_mode=S_IFDIR|0755, st_size=820, ...}) = 0
4434  lstat("/run/mount/utab", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
4434  open("/run/mount/utab", O_RDWR|O_CREAT, 0644) = 3
4434  close(3)  = 0
4434  mount("/dev/sda1", "/r1", "ext4", MS_MGC_VAL|MS_NOATIME, NULL

--

(w/ 3.6-rc1) 

[   89.868843] mount   R  running task0  4434   4433
0x0009
[   89.868847]  880c246b7b68 816c9279 880c246b7aa8
880c246b7fd8
[   89.868851]  880c246b7fd8 4000 88062720cdb0
880c246862d0
[   89.868855]  000116c0 880623a863c0 880623a863c0

[   89.868855] Call Trace:
[   89.868858]  [] ? __schedule+0x299/0x770
[   89.868860]  [] ? __schedule+0x299/0x770
[   89.868864]  [] ? ext4_get_group_desc+0x49/0xb0
[   89.868868]  [] ? ext4_calculate_overhead+0x131/0x3e0
[   89.868871]  [] ? ext4_fill_super+0x1a4b/0x28d0
[   89.868875]  [] ? mount_bdev+0x1a1/0x1e0
[   89.868877]  [] ? ext4_calculate_overhead+0x3e0/0x3e0
[   89.868880]  [] ? ext4_mount+0x10/0x20
[   89.868882]  [] ? mount_fs+0x1b/0xd0
[   89.868885]  [] ? vfs_kern_mount+0x6f/0x110
[   89.86]  [] ? do_kern_mount+0x4f/0x100
[   89.868890]  [] ? do_mount+0x2fe/0x8a0
[   89.868894]  [] ? strndup_user+0x53/0x70
[   89.868896]  [] ? sys_mount+0x90/0xe0
[   89.868899]  [] ? tracesys+0xd4/0xd9

Justin.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Upgraded from 3.4 to 3.5.1 kernel: machine does not boot

2012-08-10 Thread Justin Piszcz
On Fri, Aug 10, 2012 at 7:07 PM, Justin Piszcz
>
> Hi,
>
> Found the root cause, the 3.5.1 kernel cannot mount my ext4 filesystem
> (60TB).
>
> The 3.4 kernel works fine.
>
> This is proven by commenting out the filesystem in /etc/fstab with
> 3.5.1, and all is OK.
>
> --
>
> Hi again,
>
> I tested with linux-3.6-rc1:
>
> The same problem, here is what I get from the strace:
>
> irectory)
> 4434  readlink("/dev", 0x7fff3b05c670, 4096) = -1 EINVAL (Invalid argument)
> 4434  readlink("/dev/sda1", 0x7fff3b05c670, 4096) = -1 EINVAL (Invalid
> argument)
> 4434  readlink("/r1", 0x7fff3b05c670, 4096) = -1 EINVAL (Invalid argument)
> 4434  getuid()  = 0
> 4434  geteuid() = 0
> 4434  getgid()  = 0
> 4434  getegid() = 0
> 4434  prctl(PR_GET_DUMPABLE)= 1
> 4434  lstat("/etc/mtab", {st_mode=S_IFLNK|0777, st_size=12, ...}) = 0
> 4434  getuid()  = 0
> 4434  geteuid() = 0
> 4434  getgid()  = 0
> 4434  getegid() = 0
> 4434  prctl(PR_GET_DUMPABLE)= 1
> 4434  stat("/run", {st_mode=S_IFDIR|0755, st_size=820, ...}) = 0
> 4434  lstat("/run/mount/utab", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> 4434  open("/run/mount/utab", O_RDWR|O_CREAT, 0644) = 3
> 4434  close(3)  = 0
> 4434  mount("/dev/sda1", "/r1", "ext4", MS_MGC_VAL|MS_NOATIME, NULL
>
> --
>
> (w/ 3.6-rc1)
>
> [   89.868843] mount   R  running task0  4434   4433
> 0x0009
> [   89.868847]  880c246b7b68 816c9279 880c246b7aa8
> 880c246b7fd8
> [   89.868851]  880c246b7fd8 4000 88062720cdb0
> 880c246862d0
> [   89.868855]  000116c0 880623a863c0 880623a863c0
> 
> [   89.868855] Call Trace:
> [   89.868858]  [] ? __schedule+0x299/0x770
> [   89.868860]  [] ? __schedule+0x299/0x770
> [   89.868864]  [] ? ext4_get_group_desc+0x49/0xb0
> [   89.868868]  [] ? ext4_calculate_overhead+0x131/0x3e0
> [   89.868871]  [] ? ext4_fill_super+0x1a4b/0x28d0
> [   89.868875]  [] ? mount_bdev+0x1a1/0x1e0
> [   89.868877]  [] ? ext4_calculate_overhead+0x3e0/0x3e0
> [   89.868880]  [] ? ext4_mount+0x10/0x20
> [   89.868882]  [] ? mount_fs+0x1b/0xd0
> [   89.868885]  [] ? vfs_kern_mount+0x6f/0x110
> [   89.86]  [] ? do_kern_mount+0x4f/0x100
> [   89.868890]  [] ? do_mount+0x2fe/0x8a0
> [   89.868894]  [] ? strndup_user+0x53/0x70
> [   89.868896]  [] ? sys_mount+0x90/0xe0
> [   89.868899]  [] ? tracesys+0xd4/0xd9
>
> Justin.
>
>
>

CC: linux-ext4

Any ideas here (kernel 3.4 and below can mount 60TB ext4 no issues)
but > 3.5.1 (did not try 3.5) cannot mount the filesystem.

Justin.

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.4->(3.5.1 || 3.6-rc1) => can no longer mount 60TB ext4 filesystem

2012-08-11 Thread Justin Piszcz
Hello,

I upgrade to each new kernel release and with 3.5.1 (from 3.4) I can no
longer mount my 60TB ext4 volume.
If I boot back to 3.4, it works fine.

Details here:
https://lkml.org/lkml/2012/8/10/205

Anything I can do besides testing each 3.5-rcX to find where the regression
lies?

Justin.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Upgraded from 3.4 to 3.5.1 kernel: machine does not boot

2012-08-12 Thread Justin Piszcz
On Sun, Aug 12, 2012 at 9:10 AM, Eric Sandeen  wrote:
> On 8/10/12 11:14 PM, Justin Piszcz wrote:
>> On Fri, Aug 10, 2012 at 7:07 PM, Justin Piszcz
>>>
>>> Hi,
>>>
>>> Found the root cause, the 3.5.1 kernel cannot mount my ext4 filesystem
>>> (60TB).
>
> You are a brave man running ext4 at 60T, but thank you for testing :)
>
> Backing out 8aeb00ff85ad25453765dd339b408c0087db1527 from 3.5.1
> (952fc18ef9ec707ebdc16c0786ec360295e5ff15 upstream) probably helps?
>
> From a quick look, I think that essentially has a :
>
> for (i = 0; i < ngroups; i++) {
>
> for (j = 0; j < ngroups; j++) {
>
> }
> }
>
> type nested loop going on; for a filesystem this big it's going to take almost
> literally forever, if I read it right.
>
> -Eric

Hello,

It worked!! I can mount my filesystem now!

I pulled down 3.5 and backed out that commit, I could not quickly find
a doc to do this, so I will add how to do that below:

1. Clone Linux repo (3.5/stable as of this writing)
git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

2. List commits:
git log

3. Show a specific commit
git show 8aeb00ff85ad25453765dd339b408c0087db1527

4. How to revert the commit:
git revert 8aeb00ff85ad25453765dd339b408c0087db1527

# On branch master
nothing to commit (working directory clean)

5. Recompile, reboot, does it work?
# df -h
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda161T   17T   44T  28% /r1
# uname -a
Linux p34 3.5.0 #1 SMP Sun Aug 12 09:42:41 EDT 2012 x86_64 GNU/Linux

Yes!

CC: Greg to see if this can be backed out for 3.5.2?

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.4->(3.5.1 || 3.6-rc1) => can no longer mount 60TB ext4 filesystem

2012-08-12 Thread Justin Piszcz
On Sun, Aug 12, 2012 at 9:31 AM, Daniel Mack  wrote:
> On 11.08.2012 19:36, Justin Piszcz wrote:
>> Hello,
>>
>> I upgrade to each new kernel release and with 3.5.1 (from 3.4) I can no
>> longer mount my 60TB ext4 volume.
>> If I boot back to 3.4, it works fine.
>>
>> Details here:
>> https://lkml.org/lkml/2012/8/10/205
>>
>> Anything I can do besides testing each 3.5-rcX to find where the regression
>> lies?
>
> I'm not at all familiar with ext4 internals, but as always, an wasy way
> is to bisect such a problem. Any change you can do that?
>
>   http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
>
>
> Daniel
>

Hi,

Thanks, will save this for the future-- would of done this but Eric
Sandeen found the offending patch, I reverted it and I can mount the
filesystem now.

The bad commit was 8aeb00ff85ad25453765dd339b408c0087db1527 (per sandeen)

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Upgraded from 3.4 to 3.5.1 kernel: machine does not boot

2012-08-12 Thread Justin Piszcz
On Sun, Aug 12, 2012 at 10:13 AM, Paul Gortmaker
 wrote:
> On Sun, Aug 12, 2012 at 9:51 AM, Justin Piszcz  
> wrote:
>> On Sun, Aug 12, 2012 at 9:10 AM, Eric Sandeen  wrote:
>>> On 8/10/12 11:14 PM, Justin Piszcz wrote:
>>>> On Fri, Aug 10, 2012 at 7:07 PM, Justin Piszcz
>>>>>
>>>>> Hi,
>>>>>
>>>>> Found the root cause, the 3.5.1 kernel cannot mount my ext4 filesystem
>>>>> (60TB).
>>>
>>> You are a brave man running ext4 at 60T, but thank you for testing :)
>>>
>>> Backing out 8aeb00ff85ad25453765dd339b408c0087db1527 from 3.5.1
>>> (952fc18ef9ec707ebdc16c0786ec360295e5ff15 upstream) probably helps?
>>>
>>> From a quick look, I think that essentially has a :
>>>
>>> for (i = 0; i < ngroups; i++) {
>>>
>>> for (j = 0; j < ngroups; j++) {
>>>
>>> }
>>> }
>>>
>>> type nested loop going on; for a filesystem this big it's going to take 
>>> almost
>>> literally forever, if I read it right.
>>>
>>> -Eric
>>
>> Hello,
>>
>> It worked!! I can mount my filesystem now!
>>
>> I pulled down 3.5 and backed out that commit, I could not quickly find
>> a doc to do this, so I will add how to do that below:
>>
>> 1. Clone Linux repo (3.5/stable as of this writing)
>> git clone 
>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
>>
>> 2. List commits:
>> git log
>>
>> 3. Show a specific commit
>> git show 8aeb00ff85ad25453765dd339b408c0087db1527
>>
>> 4. How to revert the commit:
>> git revert 8aeb00ff85ad25453765dd339b408c0087db1527
>>
>> # On branch master
>> nothing to commit (working directory clean)
>
> You didn't actually revert anything here, because your clone left
> you on "master" branch, which points at 3.5 (i.e. 3.5.0).   It does
> not contain the commit which is of interest to you.
>
> -
> linux-stable$git tag --contains 8aeb00ff
> v3.5.1
> linux-stable$git branch --contains 8aeb00ff
>   linux-3.5.y
> linux-stable$
> 
>
> The master branch in linux-stable is left pointing at one of the
> most recent mainline (i.e. non-stable) tags, and all of the stable
> content is on individual branches (type "git branch" to see them).
>
> So if you do a "git checkout linux-3.5.y"  and then do the revert,
> you will actually be testing what you wanted to test.
>
> Paul.
> --

Yikes, I saw the git details (via get show) but that must check the
commit via git/inet-- I assumed that was also in the 3.5 tree, but its
not per your check, so I've made some changes to my notes, recompiled,
rebooted .. and success!!  Woohoo!

One other item, I've never used a git updated kernel before, usually
just patch -p1 the mainline or pull it down directly (but git seems
nicer now that I know how to do it), does the '+' signify its the
3.5.1 kernel and then a '+' because I made changes to it?

p34:~# df -h | grep /r1
/dev/sda161T   16T   45T  26% /r1
p34:~# uname -a
Linux p34 3.5.1+ #3 SMP Sun Aug 12 10:31:34 EDT 2012 x86_64 GNU/Linux
p34:~# uptime
 10:35:12 up 1 min,  1 user,  load average: 0.05, 0.03, 0.01
p34:~#

---


Updated notes:


1. Clone Linux repo (3.5/stable as of this writing)
git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
linux-3.5

2.0 Cd into linux-3.5
cd linux-3.5

2.1 Check available kernel versions:
git tag | tail -n 3
v3.5-rc6
v3.5-rc7
v3.5.1

2.2 Update to the latest 3.5.1 kernel:
git checkout linux-3.5.1

Note: checking out 'v3.5.1'.
..
HEAD is now at cbd3c20... Linux 3.5.1

2.3 Confirm it is 3.5.1:
# head -n 3 Makefile
VERSION = 3
PATCHLEVEL = 5
SUBLEVEL = 1

2.4 List commits:
git log

3. Show a specific commit
git show 8aeb00ff85ad25453765dd339b408c0087db1527

4. How to revert the commit:
git revert 8aeb00ff85ad25453765dd339b408c0087db1527
(It brings up a text editor like svn/cvs, commit -> write/save/quit)

[detached HEAD 35d699f] Revert "ext4: fix overhead calculation used by
ext4_statfs()"
 Committer: root 
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

git config --global user.name "Your Name"
git config --global user.email y...@example.com

After doing this, you may fix the identity used for this commit with:

git commit --amend --reset-author

 4 files changed, 57 insertions(+), 132 deletions(-)

5. Recompile, reboot, does it work, still?
# df -h
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda161T   17T   44T  28% /r1
# uname -a
Linux p34 3.5.0 #1 SMP Sun Aug 12 09:42:41 EDT 2012 x86_64 GNU/Linux

Yes!


--

How to find where a particular commit lies?

-
linux-stable$git tag --contains 8aeb00ff
v3.5.1
linux-stable$git branch --contains 8aeb00ff
  linux-3.5.y
linux-stable$

--

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.5.1: WARNING: at include/linux/iocontext.h:140 copy_process.part.56+0x1041/0x1190()

2012-08-15 Thread Justin Piszcz
Hi,

Kernel: 3.5.1 x86_64
Just FYI, this may have been due to an NFS issue (remote host turned
off during dump possibly) but reporting just incase:

[41793.725267] [ cut here ]
[41793.725273] WARNING: at include/linux/iocontext.h:140
copy_process.part.56+0x1041/0x1190()
[41793.725274] Hardware name: X9SCL/X9SCM
[41793.725276] Pid: 20378, comm: dump Not tainted 3.5.1 #2
[41793.725276] Call Trace:
[41793.725279]  [] warn_slowpath_common+0x75/0xb0
[41793.725280]  [] warn_slowpath_null+0x15/0x20
[41793.725282]  [] copy_process.part.56+0x1041/0x1190
[41793.725283]  [] do_fork+0x13b/0x2f0
[41793.725285]  [] ? recalc_sigpending+0x12/0x30
[41793.725287]  [] ? __set_current_blocked+0x3a/0x60
[41793.725290]  [] sys_clone+0x23/0x30
[41793.725293]  [] stub_clone+0x13/0x20
[41793.725294]  [] ? system_call_fastpath+0x1a/0x1f
[41793.725295] ---[ end trace 6afd1df8f82f60e4 ]---

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Supermicro X9SRL-F - channel enumeration error & ACPI/firmware bug question

2012-11-24 Thread Justin Piszcz
Hi,

Is the following normal on an X9SRL-F board (bios 1.0a)?

In the manual it states:

Data Direct I/O
Select Enabled to enable Intel I/OAT (I/O Acceleration Technology), which
significantly reduces CPU overhead by leveraging CPU architectural
improvements and freeing the system resource for other tasks. The options
are Disabled and Enabled.

Default is Enabled.

When enabled in the kernel, I see the following:

[0.696357] ioatdma: Intel(R) QuickData Technology Driver 4.00
[0.696487] ioatdma :00:04.0: channel error register unreachable
[0.696546] ioatdma :00:04.0: channel enumeration error
[0.696604] ioatdma :00:04.0: Intel(R) I/OAT DMA Engine init failed
[0.696721] ioatdma :00:04.1: channel error register unreachable
[0.696779] ioatdma :00:04.1: channel enumeration error
[0.697522] ioatdma :00:04.1: Intel(R) I/OAT DMA Engine init failed
[0.697617] ioatdma :00:04.2: channel error register unreachable
[0.697681] ioatdma :00:04.2: channel enumeration error
[0.697739] ioatdma :00:04.2: Intel(R) I/OAT DMA Engine init failed
[0.697831] ioatdma :00:04.3: channel error register unreachable
[0.697890] ioatdma :00:04.3: channel enumeration error
[0.697948] ioatdma :00:04.3: Intel(R) I/OAT DMA Engine init failed
[0.698037] ioatdma :00:04.4: channel error register unreachable
[0.698095] ioatdma :00:04.4: channel enumeration error
[0.698153] ioatdma :00:04.4: Intel(R) I/OAT DMA Engine init failed
[0.698245] ioatdma :00:04.5: channel error register unreachable
[0.698303] ioatdma :00:04.5: channel enumeration error
[0.698360] ioatdma :00:04.5: Intel(R) I/OAT DMA Engine init failed
[0.698449] ioatdma :00:04.6: channel error register unreachable
[0.698508] ioatdma :00:04.6: channel enumeration error
[0.698565] ioatdma :00:04.6: Intel(R) I/OAT DMA Engine init failed
[0.698676] ioatdma :00:04.7: channel error register unreachable
[0.698735] ioatdma :00:04.7: channel enumeration error
[0.698792] ioatdma :00:04.7: Intel(R) I/OAT DMA Engine init failed

--

Also, I tried using ASPM (enabled in BIOS), but since ACPI Linux query is
ignored, it fails to work:
[0.562229] [Firmware Bug]: ACPI: BIOS _OSI(Linux) query ignored

I assume this is something Supermicro has to fix?

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.6.10: Intel: ixgbe 0000:01:00.0 eth4: Detected Tx Unit Hang

2012-12-15 Thread Justin Piszcz
Hello,

Kernel 3.6.10, first time I have seen this that I can remember (on 10GbE)
anyway, is this a known issue with 3.6.10?

When the link went down is when I rebooted/etc the remote host attached on
the other end.
I've not changed anything physically with the hardware and have been on
3.6.0-3.6.9 and noticed this when I moved to 3.6.10.

[10270.229200] ixgbe :01:00.0 eth4: NIC Link is Down
[10276.124937] ixgbe :01:00.0 eth4: NIC Link is Up 10 Gbps, Flow
Control: RX/TX
[24529.430997] ixgbe :01:00.0 eth4: Detected Tx Unit Hang
[24529.430997]   Tx Queue <10>
[24529.430997]   TDH, TDT <4e>, <51>
[24529.430997]   next_to_use  <51>
[24529.430997]   next_to_clean<4e>
[24529.430997] tx_buffer_info[next_to_clean]
[24529.430997]   time_stamp   <10172668f>
[24529.430997]   jiffies  <101726ea4>
[24529.431011] ixgbe :01:00.0 eth4: tx hang 1 detected on queue 10,
resetting adapter
[24529.431028] ixgbe :01:00.0 eth4: Reset adapter

Thoughts?

lspci -vvxx

01:00.0 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AT2 Server
Adapter (rev 01)
  Subsystem: Intel Corporation 82598EB 10-Gigabit AT2 Server Adapter
  Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.7: Intel: ixgbe 0000:01:00.0 eth4: Detected Tx Unit Hang

2012-12-15 Thread Justin Piszcz


-Original Message-
From: Justin Piszcz [mailto:jpis...@lucidpixels.com] 
Sent: Saturday, December 15, 2012 10:49 AM
To: linux-kernel@vger.kernel.org
Subject: 3.6.10: Intel: ixgbe :01:00.0 eth4: Detected Tx Unit Hang

Hello,

CORRECTION: Kernel 3.7



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: 3.6.10: Intel: ixgbe 0000:01:00.0 eth4: Detected Tx Unit Hang

2012-12-17 Thread Justin Piszcz


-Original Message-
From: devendra.aaru [mailto:devendra.a...@gmail.com] 
Sent: Monday, December 17, 2012 1:39 PM
To: Justin Piszcz
Cc: linux-kernel@vger.kernel.org; net...@vger.kernel.org
Subject: Re: 3.6.10: Intel: ixgbe :01:00.0 eth4: Detected Tx Unit Hang

Ccing netdev
On Sat, Dec 15, 2012 at 10:49 AM, Justin Piszcz 
wrote:
> Hello,
>
> Kernel 3.6.10, first time I have seen this that I can remember (on 10GbE)
> anyway, is this a known issue with 3.6.10?
>
> When the link went down is when I rebooted/etc the remote host attached on
> the other end.
> I've not changed anything physically with the hardware and have been on
> 3.6.0-3.6.9 and noticed this when I moved to 3.6.10.

--

> I don't believe we have seen Tx hangs in validation. If you could narrow
down the conditions that lead to the Tx hang that would help a lot. Also 
>  the output of ethtool -S eth4 after the Tx hang occurs can be useful to
get an idea of the load on the interface.

> Thanks,
> Emil

--

In this case I only have two servers that mount each other's NFS volumes and
that were idle at the time, I rebooted one of the systems and that is when I
saw this, if I can get something to repeat/pattern and/or the ethtool output
I will update this thread, thank you.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


X9SCM-F-O clock drift +1 second into the future when ntp running?

2012-07-07 Thread Justin Piszcz
Hello,

I migrated from an X7SPA to an X9SCM-F-O and now gpsd/ntp no longer sync
with my GPS unit:
http://www.amazon.com/GlobalSat-BU-353-USB-GPS-Receiver/dp/B000PKX2KA

I did some digging and it looks like the system clock on this motherboard
with the latest BIOS (2.00a) runs 1 second too fast when comparing to other
NTP-synchronized machines.

When comparing the clock on this vs. an atomic clock, the system clock is ~1
second faster, which is probably why the GPS has problems syncing.

Is this a faulty motherboard clock or is this an issue with Ivy Bridge (I am
using an E3-1200 V2 CPU) with the X9SCM-F-O and BIOS 2.00a?

DMI Info:

Handle 0x, DMI type 0, 24 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: 2.0a
Release Date: 06/08/2012
Address: 0xF
Runtime Size: 64 kB
ROM Size: 8192 kB

NTP problem:

Problem, the x127 for the GPS:

$ ntpq -pn
 remote   refid  st t when poll reach   delay   offset
jitter

==
x127.127.28.0.GPS.0 l   11   16  3770.0000.363
100.414
*204.235.61.9128.174.38.133   2 u   48   64   37   48.716  -985.27
326.767
+184.105.192.247 216.218.254.202  2 u   43   64   37   90.902  -987.55
332.766
+50.7.247.11485.114.26.1942 u   42   64   37  158.445  -985.19
330.627
+69.65.40.29 209.51.161.238   2 u   43   64   37   47.733  -984.50
329.232

Any idea why it consistently has a ~1 second offset?
Is there a good way to fix this?

Before, on the X7SPA I ran gpsd+ntp for years without any issues, it
synchronized perfectly.  Is this a BIOS issue?  Kernel problem or HW issue?

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

Full output:

# adjtimex -p
 mode: 0
   offset: 0
frequency: 1523449
 maxerror: 1600
 esterror: 1600
   status: 8257
time_constant: 3
precision: 1
tolerance: 32768000
 tick: 10015
 raw time:  1341673069s 873569969us = 1341673069.873569969
return value = 5

After 5-10 minutes without ntpd running, the clock had drifted -2.27 seconds
the other direction..

# ntpdate time.nist.gov
 7 Jul 11:03:32 ntpdate[374]: step time server 128.138.140.44 offset
-2.273077 sec

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: X9SCM-F-O clock drift +1 second into the future when ntp running?

2012-07-07 Thread Justin Piszcz
-Original Message-
From: Justin Piszcz [mailto:jpis...@lucidpixels.com] 
Sent: Saturday, July 07, 2012 11:12 AM
To: p...@lists.ntp.org; linux-kernel@vger.kernel.org
Subject: X9SCM-F-O clock drift +1 second into the future when ntp running?

Hello,

I migrated from an X7SPA to an X9SCM-F-O and now gpsd/ntp no longer sync
with my GPS unit:
http://www.amazon.com/GlobalSat-BU-353-USB-GPS-Receiver/dp/B000PKX2KA

I did some digging and it looks like the system clock on this motherboard
with the latest BIOS (2.00a) runs 1 second too fast when comparing to other
NTP-synchronized machines.

--

I've done some more reading and what I've found is anything over 500ppm ntp
cannot correct for; so is this would be a bad crystal/chip or is there
something else wrong here?

Supermicro X9SCM: (see the error_ppm) is all over the map:

# adjtimex -a
  --- current ---   -- suggested --
cmos time system-cmos  error_ppm   tick  freqtick  freq
1341688282  -0.731743
1341688292  -0.731619   12.4   9993   3497075
1341688302  -0.747186-1556.7   9993   3497075   10009659788
1341688312  -0.7471176.9   9993   34970759993   3045513
1341688322  -0.7470991.8   9993   34970759993   3378325
1341688332  -0.762636-1553.7   9993   3378325   10009344163
1341688342  -0.762650   -1.4   9993   33783259993   3470512
1341688352  -0.778221-1557.1   9993   3378325   10009566038

# adjtimex -a
  --- current ---   -- suggested --
cmos time system-cmos  error_ppm   tick  freqtick  freq
1341688564  -0.591964
1341688574  -0.575373 1659.1  10009566038
1341688584  -0.574515   85.8  10009566038   10008   1497763
1341688594  -0.558007 1650.8  100095660389992   3789738
1341688604  -0.537084 2092.3  100095660389988   1071325
1341688614  -0.568492-3140.8   9988   1071325   10019   3744100
1341688624  -0.568774  -28.2   9988   10713259988   2919762
1341688634  -0.584618-1584.4   9988   1071325   10004 49663

Old MSI motherboard (Pentium 4-- see how the PPM is roughly in the same
range)

#  adjtimex -a
WARNING: CMOS time is 30.02 min ahead of system clock
  --- current ---   -- suggested --
cmos time system-cmos  error_ppm   tick  freqtick  freq
1341690163   -1800.416735
1341690173   -1800.417111  -37.6   9996   2365500
1341690183   -1800.417499  -38.7   9996   23655009996   4904456
1341690193   -1800.417894  -39.5   9996   23655009996   4956019
1341690203   -1800.418297  -40.3   9996   23655009996   5007582
1341690213   -1800.418308   -1.0   9996   50075829996   5074664
1341690223   -1800.418327   -1.9   9996   50075829996   5134039
1341690233   -1800.418353   -2.6   9996   50075829996   5179351

Found this:
http://compgroups.net/comp.protocols.time.ntp/hopelessly-broken-clock/490495

set clocksource=hpet for boot options instead of TSC

Did not make any difference, still all over the map:

# adjtimex -a
  --- current ---   -- suggested --
cmos time system-cmos  error_ppm   tick  freqtick  freq
1341690087   0.238485
1341690097   0.240449  196.4  10015   1475154
1341690107   0.257408 1695.9  10015   14751549998   1744166
1341690117   0.274176 1676.8  10015   14751549998   2995729
1341690127   0.290927 1675.1  10015   14751549998   310
1341690137   0.291142   21.5   9998   3109998   1697291
1341690147   0.275783-1535.9   9998   310   10013   5458916
1341690157   0.276029   24.6   9998   3109998   1494166

Has anyone seen anything like this before?
I checked all of the BIOS options, did not see anything out of the ordinary
here..

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel 2.6.xx - NFSv3 vs. Samba Data Transfer Semantics

2005-08-06 Thread Justin Piszcz
I have three machines with the same motherboard and gigabit ethernet, ABIT 
IC7-G.


Two are Linux (Debian)
One is Windows 2000.

When I copy 100 gigabytes from a Windows 2000 PC to either one of my Linux 
machines, I get a *SUSTAINED* transfer rate of 40-50MB/s over gigabit. 
Sustained meaning, when I watch gkrellm, eth0 never dips below 40MB/s.


When I copy 100 gigabytes from one Linux box to the other over NFS, I see 
all sorts of weirdness, 64MB/s for a few seconds, then 40MB/s, then 
10-30MB/s, then 0MB/s for 2-3 seconds then 7MB/s, it goes all over the 
place.  I have tried different (r|w)sizes without any conclusive results, 
they do not seem to make much of a difference.


A few examples, copy an ~800MB file to a Linux box:

TCP/FTP:

226 65.484 seconds (measured here), 11.98 Mbytes per second
822514728 bytes received in 65.48 secs (12266.2 kB/s)

UDP/NFSv3:

0.15user 12.00system 0:26.22elapsed 46%CPU (0avgtext+0avgdata 
0maxresident)k0inputs+0outputs (0major+148minor)pagefaults 0swaps


0.14user 13.96system 0:28.31elapsed 49%CPU (0avgtext+0avgdata 
0maxresident)k0inputs+0outputs (0major+148minor)pagefaults 0swaps


UDP/Samba, Win2K->Linux box:

$ date +%s
1123338368
$ date +%s
1123338399
1123338399 - 1123338368
31 seconds

I suppose NFS makes up for it bursting at such high speeds, but in some 
cases, a constant data rate is preferred.  Are there any methods to 
duplicate the way Samba works to NFS?  When NFS transfers are taking 
place, watching gkrellm, I see 64MB/s for a few seconds then it goes to 0 
as the disk (hda) continues to write for 3-4 seconds, this continues on 
and off.  With Samba and the W2K box pushing the data, it is more of a 
consistent stream with very few delays that are found with NFS.


I am using the Intel e1000 driver for gigabit:
:02:01.0 Ethernet controller: Intel Corp. 82547GI Gigabit Ethernet
Controller.

I *do* have the NAPI option enabled for the driver on both Linux machines.
[*]   Use Rx Polling (NAPI)

Samba Config:
# Increase overall throughput of samba.
socket options = IPTOS_LOWDELAY TCP_NODELAY SO_SNDBUF=32768 SO_RCVBUF=8192
# Set max xmit size.
max xmit = 8192

NFS Config/fstab Entry:
machine:/mount  /local/mount  nfs rw,hard,intr,nfsvers=3 0 0

I am using XFS filesystems on both Linux machines.  The drives are 7200RPM 
Seagate HDDs with either 2MB or 8MB of cache.


Are there any 'tweaks' or 'hacks' to make NFS behave more like Samba or 
just to tune it in general that are not commonly known or found on google?


Thanks,

Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Question regarding HPET the 2.6 series kernel.

2005-08-14 Thread Justin Piszcz


[*] HPET Timer Support
[*]   Provide RTC interrupt

[*] HPET - High Precision Event Timer
[*]   Allow mmap of HPET

http://tlug.up.ac.za/guides/lkcg/arch_i386.html

HPET Timer Support  HPET_TIMER
This enables the use of the HPET for the kernel's internal timer. HPET is 
the next generation timer replacing legacy 8254s. You can safely choose Y 
here. However, HPET will only be activated if the platform and the BIOS 
support this feature. Otherwise the 8254 will be used for timing services. 
Choose N to continue using the legacy 8254 timer.


How do I determine if my BIOS has this feature?

$ dmesg | grep -i hpet
$ dmesg | grep -i 8254
$ dmesg | grep -i timer
..TIMER: vector=0x31 pin1=2 pin2=-1
PCI: Setting latency timer of device :02:01.0 to 64
$

Assuming it does, is there any reason to use or not to use this feature?

Thanks,

Justin.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel 2.6: P4 SMT Question.

2005-07-05 Thread Justin Piszcz

General Question:

Would a desktop or server benefit more from SMT?

For a Pentium 4 w/HT, we use SMP.

What are the advantages/disadvantages of using SMT in the kernel?

Thanks,

Justin.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


XFS Oops Under 2.6.12.2

2005-07-09 Thread Justin Piszcz
After a couple hours of use, I get this error on a linear RAID under 
2.6.12.2 using loop-AES w/AES-256 encrypted filesystem.


Anyone know what is wrong?

Filesystem "loop1": XFS internal error xfs_da_do_buf(2) at line 2271 of 
file fs/xfs/xfs_da_btree.c.  Caller 0xc025e807

 [] xfs_da_do_buf+0x500/0x860
 [] xfs_da_read_buf+0x57/0x60
 [] xfs_da_read_buf+0x57/0x60
 [] __tcp_data_snd_check+0xcb/0xe0
 [] tcp_new_space+0x8d/0xa0
 [] tcp_v4_rcv+0x585/0x810
 [] xfs_da_read_buf+0x57/0x60
 [] xfs_dir2_block_getdents+0xa4/0x330
 [] xfs_dir2_block_getdents+0xa4/0x330
 [] ip_local_deliver_finish+0x0/0x150
 [] ip_rcv+0x391/0x510
 [] xfs_bmap_last_offset+0xc2/0x120
 [] ip_rcv_finish+0x0/0x290
 [] xfs_dir2_put_dirent64_direct+0x0/0xc0
 [] xfs_dir2_isblock+0x32/0x90
 [] xfs_dir2_put_dirent64_direct+0x0/0xc0
 [] xfs_dir2_getdents+0xa1/0x150
 [] xfs_dir2_put_dirent64_direct+0x0/0xc0
 [] xfs_readdir+0x75/0xc0
 [] linvfs_readdir+0x10e/0x270
 [] net_rx_action+0x6a/0xf0
 [] vfs_readdir+0x77/0x90
 [] filldir64+0x0/0x110
 [] sys_getdents64+0x6f/0xb2
 [] filldir64+0x0/0x110
 [] syscall_call+0x7/0xb

# lsmod
Module  Size  Used by

I am not using any special modules.
Configuration file attached.



config-2.6.12.2.txt.bz2
Description: Binary data


Kernel/Box Freezes Under Kernel 2.6.12.5

2005-08-26 Thread Justin Piszcz

Kernel 2.6.12.5:
1- 400GB Seagate 8MB cache, 7200RPM, ATA/100 drive.
2- ATA/133 Maxtor (ATA/Promise Controller)

1) Attached 400GB to Seagate 400GB drive.
2) (Not mounted yet)
3) See below

hde: 781422768 sectors (400088 MB) w/8192KiB Cache, CHS=48641/255/63, 
UDMA(100)


4) Partition with fdisk (hde1).
5) mkfs.xfs /dev/hde1

*** KERNEL FREEZE *** (ENTIRE MACHINE LOCKS UP)

Do the SAME EXACT THING on the motherboard (INTEL) controller or an 
ATA/100 Promise Controller, there are NO problems.


Either people with this problem are *not* reporting it or do not know 
where to report the problem to.


This is the second machine I have seen this problem with.

Has anyone looked into this?

Justin.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel/Box Freezes Under Kernel 2.6.12.5

2005-08-26 Thread Justin Piszcz
I have three different Maxtor (promise) ATA/133 controllers, it happens 
with all three.



On Fri, 26 Aug 2005, Patrick McFarland wrote:


On Friday 26 August 2005 05:36 pm, Justin Piszcz wrote:

2- ATA/133 Maxtor (ATA/Promise Controller)


Make sure its actually the kernel and not that controller. Go find another
identical one and test with it.

--
Patrick "Diablo-D3" McFarland || [EMAIL PROTECTED]
"Computer games don't affect kids; I mean if Pac-Man affected us as kids, we'd
all be running around in darkened rooms, munching magic pills and listening to
repetitive electronic music." -- Kristian Wilson, Nintendo, Inc, 1989


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Promise ATA/133 Errors With 2.6.10+

2005-08-26 Thread Justin Piszcz

It appears that 2.6.13-rc7 has fixed the bug.
I would like to know *What* changed, but I'll probably never find out :(

On Thu, 28 Jul 2005, Andrew Morton wrote:


Justin Piszcz <[EMAIL PROTECTED]> wrote:


I have two different machines with the 7200.8 Seagate 8MB 400GB drives.

Both have ATA/133 controllers, the error is the same on both:

Jun 24 15:24:18 localhost kernel: hde: no DRQ after issuing MULTWRITE_EXT

I put the drive on an (older) Promise ATA/100 controller = works great!
I put the drive on the second box on the motherboard IDE interface = works
great!

What happened > 2.6.10 to the promise driver?

??

Jun 24 15:24:18 localhost kernel: PDC202XX: Primary channel reset.
Jun 24 15:24:18 localhost kernel: hde: timeout waiting for DMA
Jun 24 15:24:18 localhost kernel: hde: status error: status=0x58 {
DriveReady SeekComplete DataRequest }
Jun 24 15:24:18 localhost kernel:
Jun 24 15:24:18 localhost kernel: ide: failed opcode was: unknown
Jun 24 15:24:18 localhost kernel: hde: drive not ready for command
Jun 24 15:24:18 localhost kernel: hde: status timeout: status=0xd0 { Busy
}
Jun 24 15:24:18 localhost kernel:
Jun 24 15:24:18 localhost kernel: ide: failed opcode was: unknown
Jun 24 15:24:18 localhost kernel: PDC202XX: Primary channel reset.
Jun 24 15:24:18 localhost kernel: hde: no DRQ after issuing MULTWRITE_EXT
Jun 24 15:24:18 localhost kernel: ide2: reset: success


Is this still happening in 2.6.13-rc4?

If so, can you please cc linux-kernel on the reply?  Thanks.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel 2.6.13-rc7 Latency Question

2005-08-27 Thread Justin Piszcz


These options are self-explanatory:

 x x  ( ) No Forced Preemption (Server) x 
xx x  ( ) Voluntary Kernel Preemption (Desktop) x 
xx x  (X) Preemptible Kernel (Low-Latency Desktop)  x x




It says 100 HZ or 250 HZ is good for SMP systems; however, what if I am 
using a P4 machine with 1 CPU (HT), is 1000HZ still the way to go, as its 
not really 2 *REAL* cpus?


lk x
   x x  ( ) 100 HZ 
x  x x  ( ) 250 HZ 
x  x x  (X) 1000 HZ

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel/Box Freezes Under Kernel 2.6.12.5

2005-08-28 Thread Justin Piszcz

Yes, I have two separate machines with the same controller and HDD.
As soon as I found out it fixed the bug on one of them, I changed it on 
the other, neither machine has crashed since.


On Fri, 26 Aug 2005, Patrick McFarland wrote:


On Friday 26 August 2005 05:36 pm, Justin Piszcz wrote:

2- ATA/133 Maxtor (ATA/Promise Controller)


Make sure its actually the kernel and not that controller. Go find another
identical one and test with it.

--
Patrick "Diablo-D3" McFarland || [EMAIL PROTECTED]
"Computer games don't affect kids; I mean if Pac-Man affected us as kids, we'd
all be running around in darkened rooms, munching magic pills and listening to
repetitive electronic music." -- Kristian Wilson, Nintendo, Inc, 1989


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux Kernel 2.6.13-rc7 (WORKS) (2.6.13, DRQ/System CRASH)

2005-08-31 Thread Justin Piszcz

All,

I am trying to get everyone together on this to hopefully solve a serious 
bug that I have seen on multiple machines with:


a) A Promise ATA/133 controller (ATA/100 works OK)
b) Kernel 2.6.12 or 2.6.13 (2.6.13-rc7 appears to be OK)

The drive is a Seagate 7200.8 400GB 7200RPM 8MB cache disk.
hde: ST3400832A, ATA DISK drive

With older kernels, if I *DO NOT ENABLE DMA* it does not crash.
If I *ENABLE DMA* then proceed to do anything with the disk, it will 
FREEZE the box, no oops, etc, *FREEZE*.


hdparm -t /dev/hde
mkfs.xfs -f /dev/hde1

Will freeze the box.

---

Linux Kernel 2.6.13 final experiences the same problems as 2.6.12.5.

I have e-mailed the list quite a few times with this issue, I am surprised 
very few people run into it.


Here is the error in the logs:

Aug 31 11:30:25 p34 kernel: hde: dma_timer_expiry: dma status == 0x20
Aug 31 11:30:25 p34 kernel: hde: DMA timeout retry
Aug 31 11:30:25 p34 kernel: PDC202XX: Primary channel reset.
Aug 31 11:30:25 p34 kernel: hde: timeout waiting for DMA
Aug 31 11:30:25 p34 kernel: hde: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Aug 31 11:30:25 p34 kernel: hde: drive not ready for command
Aug 31 11:30:25 p34 kernel: hde: status timeout: status=0xd0 { Busy }
Aug 31 11:30:25 p34 kernel: PDC202XX: Primary channel reset.
Aug 31 11:30:25 p34 kernel: hde: no DRQ after issuing MULTWRITE_EXT
Aug 31 11:30:25 p34 kernel: ide2: reset: success

After this, the machine locks up with 2.6.13.

With 2.6.13-rc7, I have not seen this once.

Can anyone offer any insight to why this is happening? I have a few 
machines with the ATA/133 controller and 400GB drives; therefore, I'd 
prefer to fix the problem rather than hooking up older, ATA/100 drives, 
just so I can run newer kernels...


Thanks.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Kernel 2.6.13-rc7 (WORKS) (2.6.13, DRQ/System CRASH)

2005-08-31 Thread Justin Piszcz
I do not even have IDE Taskfile Access enabled, so how is the kernel 
printing these error messages before it freezes?


linux-2.6.13/drivers/ide/ide-taskfile.c:printk(KERN_ERR 
"%s: no DRQ after issuing %sWRITE%s\n",



  lqqq ATA/ATAPI/MFM/RLL support qqqk
  x x[ ] IDE Taskfile Access


Anyone have any suggestions how I can solve this problem?


On Wed, 31 Aug 2005, Justin Piszcz wrote:


All,

I am trying to get everyone together on this to hopefully solve a serious bug 
that I have seen on multiple machines with:


a) A Promise ATA/133 controller (ATA/100 works OK)
b) Kernel 2.6.12 or 2.6.13 (2.6.13-rc7 appears to be OK)

The drive is a Seagate 7200.8 400GB 7200RPM 8MB cache disk.
hde: ST3400832A, ATA DISK drive

With older kernels, if I *DO NOT ENABLE DMA* it does not crash.
If I *ENABLE DMA* then proceed to do anything with the disk, it will FREEZE 
the box, no oops, etc, *FREEZE*.


hdparm -t /dev/hde
mkfs.xfs -f /dev/hde1

Will freeze the box.

---

Linux Kernel 2.6.13 final experiences the same problems as 2.6.12.5.

I have e-mailed the list quite a few times with this issue, I am surprised 
very few people run into it.


Here is the error in the logs:

Aug 31 11:30:25 p34 kernel: hde: dma_timer_expiry: dma status == 0x20
Aug 31 11:30:25 p34 kernel: hde: DMA timeout retry
Aug 31 11:30:25 p34 kernel: PDC202XX: Primary channel reset.
Aug 31 11:30:25 p34 kernel: hde: timeout waiting for DMA
Aug 31 11:30:25 p34 kernel: hde: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Aug 31 11:30:25 p34 kernel: hde: drive not ready for command
Aug 31 11:30:25 p34 kernel: hde: status timeout: status=0xd0 { Busy }
Aug 31 11:30:25 p34 kernel: PDC202XX: Primary channel reset.
Aug 31 11:30:25 p34 kernel: hde: no DRQ after issuing MULTWRITE_EXT
Aug 31 11:30:25 p34 kernel: ide2: reset: success

After this, the machine locks up with 2.6.13.

With 2.6.13-rc7, I have not seen this once.

Can anyone offer any insight to why this is happening? I have a few machines 
with the ATA/133 controller and 400GB drives; therefore, I'd prefer to fix 
the problem rather than hooking up older, ATA/100 drives, just so I can run 
newer kernels...


Thanks.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel 2.6.13 + IDE + MULTWRITE_EXT / DRQ Errors

2005-09-02 Thread Justin Piszcz

I still get this error when the drive is on a Promise ATA/133 card.

I have the same setup in two separate machines, the results are the same 
with kernel 2.6.13, ideas?


Should I just get more ATA/100 cards and stop trying to figure out what
the bug is?  Keep in mind the Promise ATA/100 cards exhibit no such 
errors or problems as below. I have not received any responses concerning 
this bug.


According to: 
http://www.ussg.iu.edu/hypermail/linux/kernel/0310.2/0693.html


Disabling PIO multiwrite fixes this problem, how do you do that?

As far as I can tell it is not enabled.

# hdparm -vi /dev/hde

/dev/hde:
 multcount=  0 (off)
 IO_support   =  0 (default 16-bit)
 unmaskirq=  0 (off)
 using_dma=  1 (on)
 keepsettings =  0 (off)
 readonly =  0 (off)
 readahead= 256 (on)
 geometry = 48641/255/63, sectors = 781422768, start = 0

 Model=ST3400832A, FwRev=3.01, SerialNo=3NF04YQK
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=unknown, BuffSize=8192kB, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: device does not report version:

 * signifies the current active mode

Another person states:

http://www.red-hat.com/archives/redhat-list/2000-June/msg00289.html

Hi
I've been gone for awhile but didn't see a response to this.
I solved the problem on one machine by changeing the
setting in bios from UDMA AUTO to DISABLED.
.I had another machine that periodically gave me this message
that just had the hard drive and mother board replaced by
the shop that built it because Seagate said that bios had
missed identified the drive geometry and had the head
sectors ect set wrong.
   Linda


What is the real problem here?

Kernel .config attached for 2.6.13.

Please let me know, thanks.

Sep  2 20:48:05 p34 XFS mounting filesystem hde1
Sep  2 20:48:25 p34 hde: dma_timer_expiry: dma status == 0x20
Sep  2 20:48:25 p34 hde: DMA timeout retry
Sep  2 20:48:25 p34 PDC202XX: Primary channel reset.
Sep  2 20:48:25 p34 hde: timeout waiting for DMA
Sep  2 20:48:25 p34 hde: status error: status=0x58 {
Sep  2 20:48:25 p34 DriveReady
Sep  2 20:48:25 p34 SeekComplete
Sep  2 20:48:25 p34 DataRequest
Sep  2 20:48:25 p34 }
Sep  2 20:48:25 p34 ide: failed opcode was:
Sep  2 20:48:25 p34 unknown
Sep  2 20:48:25 p34 hde: drive not ready for command
Sep  2 20:48:26 p34 hde: status timeout: status=0xd0 {
Sep  2 20:48:26 p34 Busy
Sep  2 20:48:26 p34 }
Sep  2 20:48:26 p34 ide: failed opcode was:
Sep  2 20:48:26 p34 unknown
Sep  2 20:48:26 p34 PDC202XX: Primary channel reset.
Sep  2 20:48:26 p34 hde: no DRQ after issuing MULTWRITE_EXT
Sep  2 20:48:26 p34 ide2: reset:
Sep  2 20:48:26 p34 success

lspci output:

:03:06.0 Unknown mass storage controller: Promise Technology, Inc. 
20269 (rev 02)
:03:07.0 Unknown mass storage controller: Promise Technology, Inc. 
20269 (rev 02)


$ cat /proc/interrupts
   CPU0   CPU1
  0: 219088  0IO-APIC-edge  timer
  1:  9  0IO-APIC-edge  i8042
  9:  0  0   IO-APIC-level  acpi
 14:   3201  0IO-APIC-edge  ide0
 15: 12  0IO-APIC-edge  ide1
 16:  13893  0   IO-APIC-level  eth0, libata, eth2
 17:168  0   IO-APIC-level  eth1
 18:555  0   IO-APIC-level  ide2
 19:192  0   IO-APIC-level  ide4, ide5
NMI:  0  0
LOC: 218910 218906
ERR:  0
MIS:  0

$ gcc -v
gcc version 4.0.1 (Debian 4.0.1-2)

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 3
model name  : Intel(R) Pentium(R) 4 CPU 3.40GHz
stepping: 4
cpu MHz : 3409.857
cache size  : 1024 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni monitor 
ds_cpl cid xtpr

bogomips: 6821.59

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 15
model   : 3
model name  : Intel(R) Pentium(R) 4 CPU 3.40GHz
stepping: 4
cpu MHz : 3409.857
cache size  : 1024 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags

2.6.13+netconsole captures crash

2005-09-03 Thread Justin Piszcz
On 2.6.13, I have a simple script that tars the data from the root 
filesystem to a 400GB disk, when this started, I got the following errors 
and then the machine locked up:


Again, 400GB/Seagate+ATA/133, someone should add to the CONFIG_OPTION that 
400GB drives are NOT supported w/ the Promise ATA/133 controllers.


Sep  2 21:00:55 p34 hde: dma_timer_expiry: dma status == 0x20
Sep  2 21:00:55 p34 hde: DMA timeout retry
Sep  2 21:00:55 p34 PDC202XX: Primary channel reset.
Sep  2 21:00:55 p34 hde: timeout waiting for DMA
Sep  2 21:00:55 p34 hde: status error: status=0x58 {
Sep  2 21:00:55 p34 DriveReady
Sep  2 21:00:55 p34 SeekComplete
Sep  2 21:00:55 p34 DataRequest
Sep  2 21:00:55 p34 }
Sep  2 21:00:55 p34 ide: failed opcode was:
Sep  2 21:00:55 p34 unknown
Sep  2 21:00:55 p34 hde: drive not ready for command
Sep  2 21:00:55 p34 hde: status error: status=0x50 {
Sep  2 21:00:55 p34 DriveReady
Sep  2 21:00:55 p34 SeekComplete
Sep  2 21:00:55 p34 }
Sep  2 21:00:55 p34 ide: failed opcode was:
Sep  2 21:00:55 p34 unknown
Sep  2 21:00:55 p34 hde: no DRQ after issuing MULTWRITE_EXT


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel 2.6.13 repeated ACPI events?

2005-09-05 Thread Justin Piszcz

I have a box where I keep getting this in dmesg:

ACPI: PCI Interrupt :01:00.0[A] -> Link [LNKD] -> GSI 5 (level, low) 
-> IRQ 5
ACPI: PCI Interrupt :01:00.0[A] -> Link [LNKD] -> GSI 5 (level, low) 
-> IRQ 5
ACPI: PCI Interrupt :01:00.0[A] -> Link [LNKD] -> GSI 5 (level, low) 
-> IRQ 5
ACPI: PCI Interrupt :01:00.0[A] -> Link [LNKD] -> GSI 5 (level, low) 
-> IRQ 5
ACPI: PCI Interrupt :01:00.0[A] -> Link [LNKD] -> GSI 5 (level, low) 
-> IRQ 5
ACPI: PCI Interrupt :01:00.0[A] -> Link [LNKD] -> GSI 5 (level, low) 
-> IRQ 5
ACPI: PCI Interrupt :01:00.0[A] -> Link [LNKD] -> GSI 5 (level, low) 
-> IRQ 5
ACPI: PCI Interrupt :01:00.0[A] -> Link [LNKD] -> GSI 5 (level, low) 
-> IRQ 5


# cat /proc/interrupts
   CPU0
  0:2691916  XT-PIC  timer
  1:392  XT-PIC  i8042
  2:  0  XT-PIC  cascade
  5:1120689  XT-PIC  eth1, eth0
  9:  1  XT-PIC  acpi
 14:   3938  XT-PIC  ide0
 15: 45  XT-PIC  ide1
NMI:  0
LOC:  0
ERR:  0
MIS:  0

Anyone have any idea what could cause this?

# lspci
00:00.0 Host bridge: Intel Corp. 82815 815 Chipset Host Bridge and Memory 
Controller Hub (rev 02)

00:01.0 PCI bridge: Intel Corp. 82815 815 Chipset AGP Bridge (rev 02)
00:1e.0 PCI bridge: Intel Corp. 82801AA PCI Bridge (rev 02)
00:1f.0 ISA bridge: Intel Corp. 82801AA ISA Bridge (LPC) (rev 02)
00:1f.1 IDE interface: Intel Corp. 82801AA IDE (rev 02)
00:1f.2 USB Controller: Intel Corp. 82801AA USB (rev 02)
00:1f.3 SMBus: Intel Corp. 82801AA SMBus (rev 02)
01:00.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone]
01:04.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] 
(rev 78)
02:00.0 VGA compatible controller: nVidia Corporation NV5M64 [RIVA TNT2 
Model 64/Model 64 Pro] (rev 15)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Kernel 2.6.13-rc7 (WORKS) (2.6.13, DRQ/System CRASH)

2005-09-05 Thread Justin Piszcz

Also,

Part of the problem may be that I have two ATA/133 Promise cards in one 
box and only one ATA/133 in the other box.


Kernel 2.6.13 has fixed the problem with one ATA/133 card in the box.
Kernel 2.6.13 has not fixed the problem with two ATA/133 cards in the box.

FYI

Justin.


On Mon, 5 Sep 2005, Alan Cox wrote:


On Llu, 2005-09-05 at 09:26 +0200, Bartlomiej Zolnierkiewicz wrote:

After DMA timeout driver reverted back to PIO,
ide-taskfile.c also holds PIO code besides IDE Taskfile Access.



On SMP after a DMA timeout it will potentially freeze. There are some
paths in that code which lead to double lock takes and hangs, plus some
timer races.

Justin can you make a backup (I mean that seriously), then build a
kernel with spin lock debug enabled and see if you can reproduce the
problem and get a trace.

If its the locking you'll get a trace and the kernel will continue. At
that point because the spinlock debug continues unsafely through a
double lock after the trace you are in the "danger zone" hence the
backup warning

[Yes the spin lock debug code really should warn you its dangerous for
non debug uses or get patched as it is in Fedora to trace and stop]

If its a hardware or other problem it will still hang

if its an unrelated lock problem it should still get a trace.


Why you see this only on 2.6.13 not 2.6.13-rc7 I don't know. It makes me
wonder if you have a bad drive - but then you imply going back to rc7
goes back to stable. Can you therefore also check the .config options
between the two kernels match.

Alan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-06 Thread Justin Piszcz



On Thu, 6 Dec 2007, Andrew Morton wrote:


On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
Justin Piszcz <[EMAIL PROTECTED]> wrote:


I am putting a new machine together and I have dual raptor raid 1 for the
root, which works just fine under all stress tests.

Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
sale now adays):

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)

And sometime along the way(?) (i had gone to sleep and let it run), this
occurred:

[42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401
action 0x2 frozen


Gee we're seeing a lot of these lately.


[42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
[42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
0x0 data 512 in
[42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
(ATA bus error)
[42881.841899] ata3: soft resetting port
[42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42915.919042] ata3.00: qc timeout (cmd 0xec)
[42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[42915.919149] ata3.00: revalidation failed (errno=-5)
[42915.919206] ata3: failed to recover some devices, retrying in 5 secs
[42920.912458] ata3: hard resetting port
[42926.411363] ata3: port is slow to respond, please be patient (Status
0x80)
[42930.943080] ata3: COMRESET failed (errno=-16)
[42930.943130] ata3: hard resetting port
[42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42931.413523] ata3.00: configured for UDMA/133
[42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
[42931.413655] ata3: EH complete
[42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
(750156 MB)
[42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
[42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA

Usually when I see this sort of thing with another box I have full of
raptors, it was due to a bad raptor and I never saw it again after I
replaced the disk that it happened on, but that was using the Intel P965
chipset.

For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).

I am going to do some further testing but does this indicate a bad drive?
Bad cable?  Bad connector?

As you can see above, /dev/sdc stopped responding for a little bit and
then the kernel reset the port.

Why is this though?  What is the likely root cause?  Should I replace the
drive?  Obviously this is not normal and cannot be good at all, the idea
is to put these drives in a RAID5 and if one is going to timeout that is
going to cause the array to go degraded and thus be worthless in a raid5
configuration.

Can anyone offer any insight here?


It would be interesting to try 2.6.21 or 2.6.22.



This was due to NCQ issues (disabling it fixed the problem).

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23.9: x86_64: floppy not working: p35 chipset

2007-12-07 Thread Justin Piszcz



On Fri, 7 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:
Trying to format a floppy (2-3 of them) on a GA-P35-DS4 2.0 with a regular 
Sony floppy on Debian x86_64 with kernel 2.6.23.9:


# fdformat /dev/fd0
Could not determine current format type: No such device
# mformat a:
mformat: Could not get geometry of device (No such device)
#

# cat /proc/interrupts |grep floppy
  6: 38 37 39 41   IO-APIC-edge  floppy

# dmesg|grep -A1 fd0
[   52.689487] Floppy drive(s): fd0 is 1.44M
[   52.704661] FDC 0 is a post-1991 82077

During the 'attempted format'





I've tried a few different floppies, the result is the same.  The system is 
64-bit only, no 32-bit emulation is enabled using a strict 64-bit-only 
userland.  Has anyone else gotten their floppy drive to work under 64-bit?


Is this just a case of a DOA floppy drive or is something else wrong?

Maybe booting from a 32 bit live CD would help determine that. It certainly 
was seen at boot time. Didn't get hooked to some SCSI device name by udev, 
did it?



--
Bill Davidsen <[EMAIL PROTECTED]>
 "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot



Retried with some other floppies and later tried the original, everything 
seems to be working now, must have been a bad floppy/some transient issue.


Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.23.9 64-bit IRQ sharing without irqbalance with p35 vs i965?

2007-12-08 Thread Justin Piszcz
Can some please explain with almost identical kernel .config's I see this 
on a p965 Intel board:


   CPU0   CPU1   CPU2   CPU3
  0: 2501669875  0  0  0   IO-APIC-edge  timer
  1:  8  0  0  0   IO-APIC-edge  i8042
  7:  0  0  0  0   IO-APIC-edge  parport0
  8:  1  0  0  0   IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-fasteoi   acpi
 12:   21187797  0  0  0   IO-APIC-edge  i8042
 16:   29096123  0  0  0   IO-APIC-fasteoi   
sata_sil24, uhci_hcd:usb3
 17:122  0  0  0   IO-APIC-fasteoi   libata
 18:   27177188  0  0  0   IO-APIC-fasteoi   
sata_sil24, ehci_hcd:usb1, uhci_hcd:usb7
 19:   29230654  0  0  0   IO-APIC-fasteoi   
sata_sil24, ohci1394, uhci_hcd:usb6
 21:   41626956  0  0  0   IO-APIC-fasteoi   
uhci_hcd:usb4, eth1
 22:203  0  0  0   IO-APIC-fasteoi   HDA Intel
 23:108  0  0  0   IO-APIC-fasteoi   
ehci_hcd:usb2, uhci_hcd:usb5
377:  590862399  0  0  0   PCI-MSI-edge  eth0
378:   89967666  0  0  0   PCI-MSI-edge  ahci
NMI:  0  0  0  0 
LOC: 2501568889 2501566748 2501569508 2501566435 
ERR:  0


And this on a Gigabyte p35 chipset:

   CPU0   CPU1   CPU2   CPU3
  0:4798363480624248071034801203   IO-APIC-edge  timer
  1:  1  2  3  2   IO-APIC-edge  i8042
  6:  3  0  0  0   IO-APIC-edge  floppy
  8:  0  0  0  1   IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-fasteoi   acpi
 12: 22 32 26 33   IO-APIC-edge  i8042
 16:  0  0  0  0   IO-APIC-fasteoi   ahci, 
uhci_hcd:usb3
 17:  16994  17065  16955  16952   IO-APIC-fasteoi   libata, 
eth0
 18:  0  1  1  1   IO-APIC-fasteoi   ohci1394, 
ehci_hcd:usb1, uhci_hcd:usb5, uhci_hcd:usb8
 19:  0  0  0  0   IO-APIC-fasteoi   
uhci_hcd:usb7
 21:  0  0  0  0   IO-APIC-fasteoi   
uhci_hcd:usb4
 22: 38 39 40 38   IO-APIC-fasteoi   HDA Intel
 23:  0  0  0  0   IO-APIC-fasteoi   
ehci_hcd:usb2, uhci_hcd:usb6
380:6838702682341168227046836873   PCI-MSI-edge  ahci
NMI:  0  0  0  0 
LOC:   18993695   18993662   18891135   18891030 
ERR:  0


Both are running Debian Lenny (testing) with (almost the same exact kernel 
configuration).


Why is this?

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_32: trim memory by updating e820 v2

2008-01-21 Thread Justin Piszcz



On Mon, 21 Jan 2008, Jesse Barnes wrote:


On Sunday, January 20, 2008 10:56 pm Yinghai Lu wrote:

[PATCH] x86_32: trim memory by updating e820 v2

when mtrr is not covering all e820 table, need to trim the ram, need to
update e820

reuse some code for x86_64

here need to add early_identify_cpu for x86_32, and move mtrr_bp_init early

compiled test only, need someone test it


I like this approach too.  So as long as the E820 modification method works
(i.e. we have some testers, maybe Justin can give it a try), you can add

Signed-off-by:  Jesse Barnes <[EMAIL PROTECTED]>

or

Acked-by:  Jesse Barnes <[EMAIL PROTECTED]>

as appropriate too.

Thanks,
Jesse



Subject: Re: [PATCH] x86_32: trim memory by updating e820 v2
 ^^

I run x86_64 btw-- if there is a kernel.patch for x86_64 please let me 
know and I can test, thanks.


Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_32: trim memory by updating e820 v2

2008-01-21 Thread Justin Piszcz



On Mon, 21 Jan 2008, Yinghai Lu wrote:


On Monday 21 January 2008 11:14:02 am Justin Piszcz wrote:
please get x86.git

 git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
 cd linux-2.6
 #--{ x86.git instructions }-->
 # Add Linus's tree as a remote
 git remote add linus 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

 # Add Ingo's tree as a remote
 git remote add x86 
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

 # With that setup, just run the following to get any changes you
 # don't have.  It will also notice any new branches Ingo/Linus
 # add to their repo.  Look in .git/config afterwards, the format
 # to add new remotes is easy to figure out.
 git remote update
 #-
 git merge x86/master
 git merge x86/mm

and apply

[PATCH] x86_64: check if Tom2 is enabled
http://lkml.org/lkml/2008/1/21/20
[PATCH] x86_64: update e820 instead of updating end_pfn v3
http://lkml.org/lkml/2008/1/21/19
[PATCH] x86_32: trim memory by updating e820 v2
http://lkml.org/lkml/2008/1/21/18

YH



Thanks, I am all patched up and ready to test, unfortunately one of my disks
in my RAID 1 just died, I already filled out the advanced replacement form,
I will test when I receive the replacement disk.

p34:~# lilo
Fatal: Not all RAID-1 disks are active; use '-H' to install to active disks only
p34:~# lilo -H
Warning: Partial RAID-1 install on active disks only; booting is not failsafe

Warning: Faulty disk in RAID-1 array; boot with caution!!
Fatal: Unusual RAID bios device code: 0xFF
p34:~#

Don't feel like mucking up my system at the moment.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Justin Piszcz

Quick question,

Setup a new machine last night with two raptor 150 disks.  Setup RAID1 as 
I do everywhere else, 0.90.03 superblocks (in order to be compatible with 
LILO, if you use 1.x superblocks with LILO you can't boot), and then:


/dev/sda1+sdb1 <-> /dev/md0 <-> swap
/dev/sda2+sdb2 <-> /dev/md1 <-> /boot (ext3)
/dev/sda3+sdb3 <-> /dev/md2 <-> / (xfs)

All works fine, no issues...

Quick question though, I turned off the machine, disconnected /dev/sda 
from the machine, boot from /dev/sdb, no problems, shows as degraded 
RAID1.  Turn the machine off.  Re-attach the first drive.  When I boot my 
first partition either re-synced by itself or it was not degraded, was is 
this?


So two questions:

1) If it rebuilt by itself, how come it only rebuilt /dev/md0?
2) If it did not rebuild, is it because the kernel knows it does not need 
to re-calculate parity etc for swap?


I had to:

mdadm /dev/md1 -a /dev/sda2
and
mdadm /dev/md2 -a /dev/sda3

To rebuild the /boot and /, which worked fine, I am just curious though 
why it works like this, I figured it would be all or nothing.


More info:

Not using ANY initramfs/initrd images, everything is compiled into 1 
kernel image (makes things MUCH simpler and the expected device layout etc 
is always the same, unlike initrd/etc).


Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Justin Piszcz
I am putting a new machine together and I have dual raptor raid 1 for the 
root, which works just fine under all stress tests.


Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on 
sale now adays):


I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)

And sometime along the way(?) (i had gone to sleep and let it run), this 
occurred:


[42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401 
action 0x2 frozen

[42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
[42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 
0x0 data 512 in
[42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 
(ATA bus error)

[42881.841899] ata3: soft resetting port
[42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42915.919042] ata3.00: qc timeout (cmd 0xec)
[42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[42915.919149] ata3.00: revalidation failed (errno=-5)
[42915.919206] ata3: failed to recover some devices, retrying in 5 secs
[42920.912458] ata3: hard resetting port
[42926.411363] ata3: port is slow to respond, please be patient (Status 
0x80)

[42930.943080] ata3: COMRESET failed (errno=-16)
[42930.943130] ata3: hard resetting port
[42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42931.413523] ata3.00: configured for UDMA/133
[42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
[42931.413655] ata3: EH complete
[42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors 
(750156 MB)

[42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
[42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA


Usually when I see this sort of thing with another box I have full of 
raptors, it was due to a bad raptor and I never saw it again after I 
replaced the disk that it happened on, but that was using the Intel P965 
chipset.


For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of 
the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).


I am going to do some further testing but does this indicate a bad drive? 
Bad cable?  Bad connector?


As you can see above, /dev/sdc stopped responding for a little bit and 
then the kernel reset the port.


Why is this though?  What is the likely root cause?  Should I replace the 
drive?  Obviously this is not normal and cannot be good at all, the idea 
is to put these drives in a RAID5 and if one is going to timeout that is 
going to cause the array to go degraded and thus be worthless in a raid5 
configuration.


Can anyone offer any insight here?

Thank you,

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Justin Piszcz



On Sat, 1 Dec 2007, Jan Engelhardt wrote:



On Dec 1 2007 06:26, Justin Piszcz wrote:

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)


Why would you care about what's on the disk? fdisk, mkfs and
the day-to-day operation will overwrite it _anyway_.

(If you think the disk is not empty, you should look at it
and copy off all usable warez beforehand :-)



The purpose is with any new disk its good to write to all the blocks and 
let the drive to all of the re-mapping before you put 'real' data on it. 
Let it crap out or fail before I put my data on it.


Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Justin Piszcz



On Sat, 1 Dec 2007, Jan Engelhardt wrote:



On Dec 1 2007 06:19, Justin Piszcz wrote:


RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
you use 1.x superblocks with LILO you can't boot)


Says who? (Don't use LILO ;-)

I like LILO :)




, and then:

/dev/sda1+sdb1 <-> /dev/md0 <-> swap
/dev/sda2+sdb2 <-> /dev/md1 <-> /boot (ext3)
/dev/sda3+sdb3 <-> /dev/md2 <-> / (xfs)

All works fine, no issues...

Quick question though, I turned off the machine, disconnected /dev/sda
from the machine, boot from /dev/sdb, no problems, shows as degraded
RAID1.  Turn the machine off.  Re-attach the first drive.  When I boot
my first partition either re-synced by itself or it was not degraded,
was is this?


If md0 was not touched (written to) after you disconnected sda, it also
should not be in a degraded state.


So two questions:

1) If it rebuilt by itself, how come it only rebuilt /dev/md0?


So md1/md2 was NOT rebuilt?

Correct.




2) If it did not rebuild, is it because the kernel knows it does not
   need to re-calculate parity etc for swap?


Kernel does not know what's inside an md usually. And it should not
try to be smart.

Ok.




I had to:

mdadm /dev/md1 -a /dev/sda2
and
mdadm /dev/md2 -a /dev/sda3

To rebuild the /boot and /, which worked fine, I am just curious
though why it works like this, I figured it would be all or nothing.


Devices are not automatically readded. Who knows, maybe you inserted a
different disk into sda which you don't want to be overwritten.

Makes sense, I just wanted to confirm that it was normal..




More info:

Not using ANY initramfs/initrd images, everything is compiled into 1
kernel image (makes things MUCH simpler and the expected device layout
etc is always the same, unlike initrd/etc).


My expected device layout is also always the same, _with_ initrd. Why?
Simply because mdadm.conf is copied to the initrd, and mdadm will
use your defined order.


That is another way as well, people seem to be divided.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Justin Piszcz



On Sat, 1 Dec 2007, Jan Engelhardt wrote:



On Dec 1 2007 07:12, Justin Piszcz wrote:

On Sat, 1 Dec 2007, Jan Engelhardt wrote:

On Dec 1 2007 06:19, Justin Piszcz wrote:


RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
you use 1.x superblocks with LILO you can't boot)


Says who? (Don't use LILO ;-)


I like LILO :)


LILO cares much less about disk layout / filesystems than GRUB does,
so I would have expected LILO to cope with all sorts of superblocks.
OTOH I would suspect GRUB to only handle 0.90 and 1.0, where the MDSB
is at the end of the disk <=> the filesystem SB is at the very beginning.


So two questions:

1) If it rebuilt by itself, how come it only rebuilt /dev/md0?


So md1/md2 was NOT rebuilt?


Correct.


Well it should, after they are readded using -a.
If they still don't, then perhaps another resync is in progress.



There was nothing in progress, md0 was synced up and md1,md2 = degraded.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-02 Thread Justin Piszcz



On Sat, 1 Dec 2007, Justin Piszcz wrote:




On Sat, 1 Dec 2007, Janek Kozicki wrote:

Justin Piszcz said: (by the date of Sat, 1 Dec 2007 07:23:41 -0500 
(EST))



dd if=/dev/zero of=/dev/sdc


The purpose is with any new disk its good to write to all the blocks and
let the drive to all of the re-mapping before you put 'real' data on it.
Let it crap out or fail before I put my data on it.


better use badblocks. It writes data, then reads it afterwards:
In this example the data is semi random (quicker than /dev/urandom ;)

badblocks -c 10240 -s -w -t random -v /dev/sdc

--
Janek Kozicki |
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Will give this a shot and see if I can reproduce the error, thanks.



The badblocks did not do anything; however, when I built a software raid 5 
and the performed a dd:


/usr/bin/time dd if=/dev/zero of=fill_disk bs=1M

I saw this somewhere along the way:

[30189.967531] RAID5 conf printout:
[30189.967576]  --- rd:3 wd:3
[30189.967617]  disk 0, o:1, dev:sdc1
[30189.967660]  disk 1, o:1, dev:sdd1
[30189.967716]  disk 2, o:1, dev:sde1
[42332.936615] ata5.00: exception Emask 0x2 SAct 0x7000 SErr 0x0 action 
0x2 frozen
[42332.936706] ata5.00: spurious completions during NCQ issue=0x0 
SAct=0x7000 FIS=004040a1:0800
[42332.936804] ata5.00: cmd 61/08:60:6f:4d:2a/00:00:27:00:00/40 tag 12 cdb 
0x0 data 4096 out
[42332.936805]  res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 
(HSM violation)
[42332.936977] ata5.00: cmd 61/08:68:77:4d:2a/00:00:27:00:00/40 tag 13 cdb 
0x0 data 4096 out
[42332.936981]  res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 
(HSM violation)
[42332.937162] ata5.00: cmd 61/00:70:0f:49:2a/04:00:27:00:00/40 tag 14 cdb 
0x0 data 524288 out
[42332.937163]  res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 
(HSM violation)

[42333.240054] ata5: soft resetting port
[42333.494462] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42333.506592] ata5.00: configured for UDMA/133
[42333.506652] ata5: EH complete
[42333.506741] sd 4:0:0:0: [sde] 1465149168 512-byte hardware sectors 
(750156 MB)

[42333.506834] sd 4:0:0:0: [sde] Write Protect is off
[42333.506887] sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
[42333.506905] sd 4:0:0:0: [sde] Write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA


Next test, I will turn off NCQ and try to make the problem re-occur.
If anyone else has any thoughts here..?
I ran long smart tests on all 3 disks, they all ran successfully.

Perhaps these drives need to be NCQ BLACKLISTED with the P35 chipset?

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.23.9: x86_64: floppy not working: p35 chipset

2007-12-06 Thread Justin Piszcz
Trying to format a floppy (2-3 of them) on a GA-P35-DS4 2.0 with a regular 
Sony floppy on Debian x86_64 with kernel 2.6.23.9:


# fdformat /dev/fd0
Could not determine current format type: No such device
# mformat a:
mformat: Could not get geometry of device (No such device)
#

# cat /proc/interrupts |grep floppy
  6: 38 37 39 41   IO-APIC-edge  floppy

# dmesg|grep -A1 fd0
[   52.689487] Floppy drive(s): fd0 is 1.44M
[   52.704661] FDC 0 is a post-1991 82077

During the 'attempted format'

[34324.175770] floppy0: probe failed...
[34324.575342] floppy0: probe failed...
[34324.974733] floppy0: probe failed...
[34325.374302] floppy0: probe failed...
[34325.773746] floppy0: probe failed...
[34326.173312] floppy0: probe failed...
[34326.572734] floppy0: probe failed...
[34326.972241] floppy0: probe failed...
[34327.371739] floppy0: probe failed...
[34327.771344] floppy0: probe failed...
[34328.170727] floppy0: probe failed...
[34328.570283] floppy0: probe failed...
[34328.969717] floppy0: probe failed...
[34329.369275] floppy0: probe failed...
[34329.768691] floppy0: probe failed...
[34330.168197] floppy0: probe failed...
[34330.168257] end_request: I/O error, dev fd0, sector 0

I've tried a few different floppies, the result is the same.  The system 
is 64-bit only, no 32-bit emulation is enabled using a strict 64-bit-only 
userland.  Has anyone else gotten their floppy drive to work under 64-bit?


Is this just a case of a DOA floppy drive or is something else wrong?

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_32: trim memory by updating e820 v2

2008-01-25 Thread Justin Piszcz



On Fri, 25 Jan 2008, Yinghai Lu wrote:


On Jan 25, 2008 4:01 PM, Justin Piszcz <[EMAIL PROTECTED]> wrote:




...

Tried it, it worked successfully!

With stock kernel, previous way I had to use it was mem=8832M and top
showed this:

top - 18:53:52 up 1 min,  2 users,  load average: 1.03, 0.30, 0.10
Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.1%us,  2.6%sy,  4.5%ni, 81.3%id,  5.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8039464k total,  1288948k used,  6750516k free, 3640k buffers
Swap: 16787768k total,0k used, 16787768k free,   178528k cached

With kernel you mentioned and use e820 v3:

top - 18:48:13 up 3 min,  6 users,  load average: 1.67, 0.68, 0.25
Tasks: 195 total,   2 running, 193 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.5%us,  1.2%sy,  1.6%ni, 74.8%id,  3.9%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8037668k total,  1438732k used,  6598936k free, 6844k buffers
Swap: 16787768k total,0k used, 16787768k free,   273928k cached

No append mem= required.




thanks

any chance to try 32 bit with higemem64 option?

YH



My distribution is setup for 64-bit (64bit-clean) only, I do not have a 
32-bit userland, so cannot help here, sorry.


Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_32: trim memory by updating e820 v2

2008-01-25 Thread Justin Piszcz



On Tue, 22 Jan 2008, Yinghai Lu wrote:


On Monday 21 January 2008 01:37:09 pm Justin Piszcz wrote:


On Mon, 21 Jan 2008, Yinghai Lu wrote:


On Monday 21 January 2008 11:14:02 am Justin Piszcz wrote:
please get x86.git

 git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
 cd linux-2.6
 #--{ x86.git instructions }-->
 # Add Linus's tree as a remote
 git remote add linus 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

 # Add Ingo's tree as a remote
 git remote add x86 
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

 # With that setup, just run the following to get any changes you
 # don't have.  It will also notice any new branches Ingo/Linus
 # add to their repo.  Look in .git/config afterwards, the format
 # to add new remotes is easy to figure out.
 git remote update
 #-
 git merge x86/master
 git merge x86/mm

and apply

[PATCH] x86_64: check if Tom2 is enabled
http://lkml.org/lkml/2008/1/21/20
[PATCH] x86_64: update e820 instead of updating end_pfn v3
http://lkml.org/lkml/2008/1/21/19
[PATCH] x86_32: trim memory by updating e820 v2
http://lkml.org/lkml/2008/1/21/18

YH



Thanks, I am all patched up and ready to test, unfortunately one of my disks
in my RAID 1 just died, I already filled out the advanced replacement form,
I will test when I receive the replacement disk.


please get x86.git and apply
[PATCH] x86_32: trim memory by updating e820 v3
http://lkml.org/lkml/2008/1/22/394

Ingo already put other two into the tree.

Thanks

YH



Tried it, it worked successfully!

With stock kernel, previous way I had to use it was mem=8832M and top 
showed this:


top - 18:53:52 up 1 min,  2 users,  load average: 1.03, 0.30, 0.10
Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.1%us,  2.6%sy,  4.5%ni, 81.3%id,  5.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8039464k total,  1288948k used,  6750516k free, 3640k buffers
Swap: 16787768k total,0k used, 16787768k free,   178528k cached

With kernel you mentioned and use e820 v3:

top - 18:48:13 up 3 min,  6 users,  load average: 1.67, 0.68, 0.25
Tasks: 195 total,   2 running, 193 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.5%us,  1.2%sy,  1.6%ni, 74.8%id,  3.9%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8037668k total,  1438732k used,  6598936k free, 6844k buffers
Swap: 16787768k total,0k used, 16787768k free,   273928k cached

No append mem= required.

A full dmesg is attached so you can analyze the e820/MTRR mapping.

File: dmesg-e820v3patch.txt.bz2

Justin.


dmesg-e820v3patch.txt.bz2
Description: Binary data


Re: [PATCH] x86_32: trim memory by updating e820 v2

2008-01-28 Thread Justin Piszcz



On Mon, 28 Jan 2008, Ingo Molnar wrote:



* Justin Piszcz <[EMAIL PROTECTED]> wrote:


Tried it, it worked successfully!

With stock kernel, previous way I had to use it was mem=8832M and top
showed this:

top - 18:53:52 up 1 min,  2 users,  load average: 1.03, 0.30, 0.10
Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.1%us,  2.6%sy,  4.5%ni, 81.3%id,  5.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8039464k total,  1288948k used,  6750516k free, 3640k buffers
Swap: 16787768k total,0k used, 16787768k free,   178528k cached

With kernel you mentioned and use e820 v3:

top - 18:48:13 up 3 min,  6 users,  load average: 1.67, 0.68, 0.25
Tasks: 195 total,   2 running, 193 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.5%us,  1.2%sy,  1.6%ni, 74.8%id,  3.9%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8037668k total,  1438732k used,  6598936k free, 6844k buffers
Swap: 16787768k total,0k used, 16787768k free,   273928k cached

No append mem= required.

A full dmesg is attached so you can analyze the e820/MTRR mapping.


thanks for testing it! The code indeed successfully trimmed your memory
map by 64MB:

from:

[0.00]  BIOS-e820: 0001 - 00022c00 (usable)

to:

[0.00]   modified: 0001 - 00022800 (usable)
[0.00]   modified: 00022800 - 00022c00 (reserved)

what happened on your box previously when you booted without any
trimming - did it sometimes slow down or something like that?

Ingo



When I boot the box without any trimming it acts like a 286 or 386, takes 
about 10 minutes to boot (using raptor disks).


Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Reading Bad DVD Under 2.6.10 freezes the box.

2005-02-07 Thread Justin Piszcz
I have a DVD where I have three files on it, (1.7gb,1.7gb,900mb).
On W2K, when I try to copy the second file, I get a BadCRC error message.
Under Linux, I copy up to about 860MB (watched via pipebench) and then it 
freezes the machine, I cannot ping or get to it or do anything on the 
console; instead, I am forced to hard reboot.

Main Question >> Why does Linux 'freeze up' when W2K gives a BadCRC error 
msg (never freezes)?

The DVD FS is Joilet+ISO (hence, why none of the files are bigger than 
2GB), is this normal?  Or is there no checking code when there are errors 
on DVD's to kill the read/etc so it does not freeze the box?

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: Reading Bad DVD Under 2.6.10 freezes the box.

2005-02-07 Thread Justin Piszcz
Yeah, I can try 2.4.29 later tonight; also, the DVD is not scratched, just 
formatted with Joilet/ISO instead of UDF (which is what should be used on 
DVDs).

However, dd if=/dev/hdh of=file.img
 Even with bs=1 for 1 byte at a time, there seems to be no way to
 get the data off, however...
 With the dd, last time I tried it, it just fails.
 When I use cp to try and copy the file, it freezes the machine.
This is all under 2.6.10 with a Toshiba 16X DVD-ROM (I can get model 
number later.)

On Mon, 7 Feb 2005, Xavier Bestel wrote:
Le lundi 07 fÃvrier 2005 Ã 08:05 -0500, linux-os a Ãcrit :
Main Question >> Why does Linux 'freeze up' when W2K gives a BadCRC error msg
(never freezes)?
Of course it should not. However, there were many incomplete changes
made in 2.6.nn and some may involve problems with locking, etc.
I don't remember a version of the kernel gracefully handling scratched
CD/DVD.
Xav


Question regarding e1000 driver and dropped packets (2.6.5 / 2.6.10)?

2005-02-08 Thread Justin Piszcz
I have two identical machines [mobo/hardware wise]:
Each machine is a Dell GX1p (500MHZ).
I have two Intel Gigabit NICs, one in each box, hooked up to a GigE 
switch.

Ethernet controller: Intel Corp. 82541GI/PI Gigabit Ethernet Controller
Ethernet controller: Intel Corp. 82541GI/PI Gigabit Ethernet Controller
I doubt its the kernel version; does anyone have any suggestions/ideas why 
one machine has virtually NO overruns/errors/drops and the other has tons?

Also, (I doubt this to be the case but I'll ask anyway) - Is the way the 
NIC's are setup in the box next to other cards / alter their PCI/IRQ 
routing which would effect error/drop rates?

IE:
PCI1 - promise card / pata
PCI2 - promise card / pata
PCI3 - promise card / sata
PCI4 - e1000 nic
PCI5 - 4 port nic
Would it make sense to order them in a different direction?
Also, is there a correlation between errors on the NIC and ERR
in /proc/interrupts?
Secondly, could loading lm-sensors/temperature modules be causing these
problems?
dmesg from box2 below:
e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
eth1: Setting full-duplex based on MII#1 link partner capability of 45e1.
eth2: Setting full-duplex based on MII#1 link partner capability of 45e1.
nfs warning: mount version older than kernel
nfs warning: mount version older than kernel
nfs warning: mount version older than kernel
nfs warning: mount version older than kernel
i2c /dev entries driver
piix4_smbus :00:07.3: Found :00:07.3 device
piix4_smbus :00:07.3: WARNING: SMBus interface has been FORCEFULLY 
ENABLED!
mtrr: no MTRR for fd00,80 found
spurious 8259A interrupt: IRQ7.
spurious 8259A interrupt: IRQ15.

I am currently out of ideas, if anyone can suggest anything, I'd be most 
greatful, thanks!

On the first box, there are hardly any problems receiving packets:
Note the errors & dropped on the receiving end:
BOX1: (2.6.5)
eth0  Link encap:Ethernet  HWaddr 00:0E:0C:00:CD:B1
  inet addr:10.0.2.254  Bcast:10.0.2.255  Mask:255.255.255.0
  inet6 addr: fe80::20e:cff:fe00:cdb1/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:196787934 errors:4 dropped:0 overruns:0 frame:2
  TX packets:101356779 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:2602045376 (2481.5 Mb)  TX bytes:4051930608 (3864.2 Mb)
  Base address:0xcc80 Memory:ff02-ff04
BOX1 MODULES:
$ lsmod
Module  Size  Used by
ip_nat_ftp  4016  0
ip_conntrack_ftp   71088  1 ip_nat_ftp
BOX2: (2.6.10)
On another box (same physical HW) I get this:
eth0  Link encap:Ethernet  HWaddr 00:0E:0C:00:D2:06
  inet addr:10.0.2.253  Bcast:10.0.2.255  Mask:255.255.255.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
-->   RX packets:446380046 errors:1276833 dropped:1276833 overruns:1276833 
frame:0
  TX packets:572550636 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:2351750726 (2.1 GiB)  TX bytes:3659840330 (3.4 GiB)
  Base address:0xd8c0 Memory:f8fa-f8fc
BOX2 MODULES:
$ lsmod
Module  Size  Used by
ip_nat_irc  3408  0
ip_conntrack_irc   70480  1 ip_nat_irc
ip_nat_ftp  4112  0
ip_conntrack_ftp   71344  1 ip_nat_ftp
adm102111060  0
i2c_piix4   6000  0
i2c_sensor  2784  1 adm1021
i2c_dev 7680  0
i2c_core   18224  4 adm1021,i2c_piix4,i2c_sensor,i2c_dev
I have tried using different cable and ports on the switch, the result is 
the same.

$ tar cvf /box2/4gb_of_stuff.tar 4gb_of_stuff  # then the numbers rise rapidly
After copying only 1-2GB on BOX2, this is what I get:
eth0  Link encap:Ethernet  HWaddr 00:0E:0C:00:D2:06
  inet addr:10.0.2.253  Bcast:10.0.2.255  Mask:255.255.255.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:1038733 errors:1459 dropped:1459 overruns:1459 frame:0
  TX packets:560952 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:1491121900 (1.3 GiB)  TX bytes:763420385 (728.0 MiB)
  Base address:0xd8c0 Memory:f8fa-f8fc
The only thing that is different is one has more HDD's and an extra PCI 
controller or so:

BOX1 LSPCI:
00:00.0 Host bridge: Intel Corp. 440BX/ZX/DX - 82443BX/ZX/DX Host bridge 
(rev 03)
00:01.0 PCI bridge: Intel Corp. 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge 
(rev 03)
00:07.0 ISA bridge: Intel Corp. 82371AB/EB/MB PIIX4 ISA (rev 02)
00:07.1 IDE interface: Intel Corp. 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.2 USB Controller: Intel Corp. 82371AB/EB/MB PIIX4 USB (rev 01)
00:07.3 Bridge: Intel Corp. 82371AB/EB/MB PIIX4 ACPI (rev 02)
00:0d.0 Ethernet controller: Intel Corp. 82541GI/PI Gigabit Ethernet 
Controller
00:0e.0 Unknown mass storage controller: Promise Technology, Inc. 20268 
(rev 02)
00:0f.0 PCI

Re: Question regarding e1000 driver and dropped packets (2.6.5 / 2.6.10)?

2005-02-09 Thread Justin Piszcz
As far as the temp stuff, it does support i2c over the smbus.
$ sensors
max1617-i2c-0-1a
Adapter: SMBus PIIX4 adapter at 0850
Board:   +48C  (low  =   -55C, high =  +127C)
CPU: +49C  (low  =   -55C, high =  +110C)

Whether its recommended or not, not sure, I'll try later today w/out it.
On Tue, 8 Feb 2005, Bukie Mabayoje wrote:

Bukie Mabayoje wrote:
Can you do a simple test?
Connect the two box to the same switch. ( No other box should be on the 
physical bus)
1. Send packets from BoxA  --->   BoxB  ( Record the stats)
2. Send packets from BoxB ---> BoxA(Record the stats)
3. Send packets simultaneously  from  BoxB->BoxA and BoxA  -> BoxB  
(Record the stats)
if you can find a third box
4. Send packets [BoxA and BoxC] ->   BoxB and BoxB -> BoxA (Record 
the stats)
5. Send packets [BoxB and BoxC] -> BoxA and BoxA --> BoxB (Record 
the stats)
I don't understand why you received more packet on BoxB. A controlled test will 
help clarify any ambiguity.
  [BoxA]   RX packets:196787934 errors:4 dropped:0 overruns:0 frame:2
TX packets:101356779 errors:0 dropped:0 overruns:0 carrier:0
 [BoxB]RX packets:446380046 errors:1276833 dropped:1276833 
overruns:1276833 frame:0
TX packets:572550636 errors:0 dropped:0 overruns:0 carrier:0
Justin Piszcz wrote:
I have two identical machines [mobo/hardware wise]:
Each machine is a Dell GX1p (500MHZ).
I have two Intel Gigabit NICs, one in each box, hooked up to a GigE
switch.
Ethernet controller: Intel Corp. 82541GI/PI Gigabit Ethernet Controller
Ethernet controller: Intel Corp. 82541GI/PI Gigabit Ethernet Controller
I doubt its the kernel version; does anyone have any suggestions/ideas why
one machine has virtually NO overruns/errors/drops and the other has tons?
Also, (I doubt this to be the case but I'll ask anyway) - Is the way the
NIC's are setup in the box next to other cards / alter their PCI/IRQ
routing which would effect error/drop rates?
IE:
PCI1 - promise card / pata
PCI2 - promise card / pata
PCI3 - promise card / sata
PCI4 - e1000 nic
PCI5 - 4 port nic
What matters is which INT# [A,B,C,D] line and/or combination  the PCI slot 
1, 2, 3, 4 is using.
You can find out by running lspci -vv
If they are routed to the same system interrupt and  lastly, the interrupt 
priority issues.

Would it make sense to order them in a different direction?
May not help in identifying the problem.

Also, is there a correlation between errors on the NIC and ERR
in /proc/interrupts?
Maybe..

Secondly, could loading lm-sensors/temperature modules be causing these
problems?
You don't have any overrun on this box.
My Error. It may be related. Try without loading ln-sensor/temp modules.
I don't think your mother board supports the i2c stuff you are loading.
You have the Intel 440BX AGP chipset and there is not i2c interface on it.


dmesg from box2 below:
e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
eth1: Setting full-duplex based on MII#1 link partner capability of 45e1.
eth2: Setting full-duplex based on MII#1 link partner capability of 45e1.
nfs warning: mount version older than kernel
nfs warning: mount version older than kernel
nfs warning: mount version older than kernel
nfs warning: mount version older than kernel
i2c /dev entries driver
piix4_smbus :00:07.3: Found :00:07.3 device
piix4_smbus :00:07.3: WARNING: SMBus interface has been FORCEFULLY
ENABLED!
mtrr: no MTRR for fd00,80 found
spurious 8259A interrupt: IRQ7.
spurious 8259A interrupt: IRQ15.
I am currently out of ideas, if anyone can suggest anything, I'd be most
greatful, thanks!
On the first box, there are hardly any problems receiving packets:
Note the errors & dropped on the receiving end:
BOX1: (2.6.5)
eth0  Link encap:Ethernet  HWaddr 00:0E:0C:00:CD:B1
   inet addr:10.0.2.254  Bcast:10.0.2.255  Mask:255.255.255.0
   inet6 addr: fe80::20e:cff:fe00:cdb1/64 Scope:Link
   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
   RX packets:196787934 errors:4 dropped:0 overruns:0 frame:2
   TX packets:101356779 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000
   RX bytes:2602045376 (2481.5 Mb)  TX bytes:4051930608 (3864.2 Mb)
   Base address:0xcc80 Memory:ff02-ff04
BOX1 MODULES:
$ lsmod
Module  Size  Used by
ip_nat_ftp  4016  0
ip_conntrack_ftp   71088  1 ip_nat_ftp
BOX2: (2.6.10)
On another box (same physical HW) I get this:
eth0  Link encap:Ethernet  HWaddr 00:0E:0C:00:D2:06
   inet addr:10.0.2.253  Bcast:10.0.2.255  Mask:255.255.255.0
   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
-->   RX packets:446380046 errors:1276833 dropped:1276833 overruns:1276833 
frame:0
   TX packets:572550636 errors:0 dropped:0 overruns:0 carrier:0
   collisi

Intel Gigabit NIC (2.6.5 -> 2.6.10) Bug(?) Found

2005-02-20 Thread Justin Piszcz
What is this e-mail about?
Something in the kernel changed regarding the Intel e1000 driver from 
2.6.5 to 2.6.10. The change resulted in thousands of errors when the NIC 
is receiving data. For the past two weeks I have thought about this and 
tried everything I could think of, it had really been pestering me. 
Normally, I never really looked at my ifconfig eth0, eth1 etc because I 
looked at it a long time ago and noticed it was just fine, this was with 
earlier kernels.  I guess I should check my NIC statistics more often. I 
have tried the following to figure out why I get so many dropped packets 
and errors on an interface:

1] New Intel [same model] NIC.
2] Different ports in the switch.
3] New cable.
4] Switched PCI slots for the Intel Gigabit Card.
5] Switched BIOS settings/parameters to exact settings as other, identical
   machine.
None of these fixed the problem. There are two machines (same model) here 
with GigE nics, on one there are  very few (1-3) if any errors on the nic 
ever.  The test that I used that reproduces the problem the quickest is dd 
if=/dev/zero of=/nfsv3/udp/file.img where the dd is on another box sending 
to the box that gets the RX errors on the NIC.  Generally, there would be 
about 100 errors every 10 seconds.  There are two identical machines on 
the network here, both with this same Intel Gigabit NIC (82541GI/PI).  So 
one machine is running 2.6.5, the other 2.6.10, I figured it had to 
something in the kernel that was causing this.  Therefore I grabbed 
ethtool and installed it and did a basic query for network setting 
parameters, immediately I noticed a difference, which is shown below:

* Box with no problems.
# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate:  on
RX: on
TX: on
* Box with NIC that generates errors, dropped packets and overrun errors.
# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate:  on
RX: off
TX: off
According to the manpage:
   -A change the pause parameters of the specified ethernet 
device.

   rx on|off
  Specify if RX pause is enabled.
   tx on|off
  Specify if TX pause is enabled.
# ethtool -A eth0 rx on
# ethtool -A eth0 tx on
My machine now:
# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate:  on
RX: on
TX: on
Then, I re-run the dd command mentioned earlier and let it run for about 
ten minutes, long and behold not a single dropped packet, overrun or frame
error reported!

  RX packets:6157606 errors:0 dropped:0 overruns:0 frame:0
Previously, this is what I would get after only a minute of running that 
dd command (I also get the errors copying files etc, dd command just 
speeds things up):

  RX packets:6374096 errors:1419 dropped:1419 overruns:1419 frame:0
Afterwards, I no longer have any errors:
To the Intel/Kernel guys:
Question, these are identical machines for the most part, even the same 
nics are used in each box, why in 2.6.5 are the settings set differently 
than that in 2.6.10?  I do not believe that it is a distribution specific 
error as I did not even have ethtool installed before I checked this nor 
do I see it any boot scripts?  For now, I will just have it set the 
proper settings -A tx on and -A rx on but is there another way to do this 
or did it change in the kernel at some point?

Further investigation reveals on my main machine with an onboard Intel/PRO 
1000 built-in NIC which runs on the CSA bus (A-Bit IC7-G) the pause 
feature is also off; HOWEVER, (2.6GHZ w/HT) this machine does not exhibit 
any errors!

  RX packets:2471666 errors:0 dropped:0 overruns:0 frame:0
  TX packets:56413066 errors:0 dropped:0 overruns:0 carrier:0
Is it a bug that it defaults to off in the newer kernel versions, as it 
causes MASSIVE errors on the RX side of the fence?  Or should people who 
run gigabit interfaces on slower machines just add the ethool commands to 
their startup scripts to avoid the errors/etc?

There may be some parallel between speed_OF_CPU and whether it can 
handle it with the pause option on or off.  If anyone has any idea of what 
the pause option is about and why it changed from 2.6.5 to 2.6.10, I'd 
like to know!

Thanks!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >