kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin
4 days ago I had Ubuntu Lucid running on this computer. Suspend and
resume worked flawlessly every time.

Then I upgraded to Ubuntu Precise. Suspend seems to work, but resume
fails every time. The video never initializes.  By the flashing
keyboard lights, I guess it's a kernel panic. It fails from the Live
CD and from a fresh install.

Here is my debug so far.

Install all updates (3.2 kernel, nouveau driver)
Reboot
Try suspend = fails

Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
Reboot
Try suspend = fails

Install nVidia's 304 driver
Reboot
Try suspend = fails

>From within X:
  echo core > /sys/power/pm_test
  echo mem > /sys/power/state
The system acts like it is going to sleep, and then wakes up a few
seconds later. dmesg shows:

[ 1230.083404] [ cut here ]
[ 1230.083410] WARNING: at
/build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
suspend_test_finish+0x86/0x90()
[ 1230.083411] Hardware name: To Be Filled By O.E.M.
[ 1230.083412] Component: resume devices, time: 14424
[ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
pata_marvell
[ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
#32~precise1-Ubuntu
[ 1230.083446] Call Trace:
[ 1230.083448] [] warn_slowpath_common+0x7f/0xc0
[ 1230.083452] [] warn_slowpath_fmt+0x46/0x50
[ 1230.083455] [] suspend_test_finish+0x86/0x90
[ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200
[ 1230.083460] [] enter_state+0xd1/0x100
[ 1230.083463] [] pm_suspend+0x1b/0x60
[ 1230.083465] [] state_store+0x45/0x70
[ 1230.083467] [] kobj_attr_store+0xf/0x30
[ 1230.083471] [] sysfs_write_file+0xef/0x170
[ 1230.083476] [] vfs_write+0xb3/0x180
[ 1230.083480] [] sys_write+0x4a/0x90
[ 1230.083483] [] system_call_fastpath+0x16/0x1b
[ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]---

Boot with init=/bin/bash
unload all modules except USBHID
echo core > /sys/power/pm_test
echo mem > /sys/power/state
system acts like it is going to sleep, and then wakes up a few seconds later
echo none > /sys/power/pm_test
echo mem > /sys/power/state
system goes to sleep
press power to resume = fails

At this point I am stumped on how to debug. This is a "modern"
computer with no serial ports. It worked under Lucid, so I know it is
POSSIBLE.

Mobo: ASRock X58 single-socket
CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz
RAM: 12 GB ECC
Disk: sda = Intel SSD, mounted on /
Disk: sdb = Intel SSD, not mounted
Disk: sdc = Seagate HDD, not mounted
Disk: sdd = Seagate HDD, not mounted
NIC = Onboard RTL8168e/8111e
Sound = EMU1212 (emu10k1, not even configured yet)
Video = nVidia GeForce 7600 GT
KB = PS2 (also tried USB)
Mouse = USB

I have not updated to a more current kernel than 3.5, but I will if
there's evidence that this is resolved.  Any other clever trick to
try?

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin
Running a suspend with pm_trace set, I get:

aer :00:03.0:pcie02: hash matches

I don't know what magic might be needed here, though.

I guess next step is to try to build a non-distro kernel.

On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki  wrote:
> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
>> resume worked flawlessly every time.
>>
>> Then I upgraded to Ubuntu Precise.
>
> Well, do you use a distro kernel or a kernel.org kernel?
>
>> Suspend seems to work, but resume
>> fails every time. The video never initializes.  By the flashing
>> keyboard lights, I guess it's a kernel panic. It fails from the Live
>> CD and from a fresh install.
>>
>> Here is my debug so far.
>>
>> Install all updates (3.2 kernel, nouveau driver)
>> Reboot
>> Try suspend = fails
>>
>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
>> Reboot
>> Try suspend = fails
>>
>> Install nVidia's 304 driver
>> Reboot
>> Try suspend = fails
>>
>> From within X:
>>   echo core > /sys/power/pm_test
>>   echo mem > /sys/power/state
>> The system acts like it is going to sleep, and then wakes up a few
>> seconds later. dmesg shows:
>>
>> [ 1230.083404] [ cut here ]
>> [ 1230.083410] WARNING: at
>> /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
>> suspend_test_finish+0x86/0x90()
>> [ 1230.083411] Hardware name: To Be Filled By O.E.M.
>> [ 1230.083412] Component: resume devices, time: 14424
>> [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
>> snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
>> nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
>> snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
>> snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
>> ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
>> bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
>> shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
>> pata_marvell
>> [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
>> #32~precise1-Ubuntu
>> [ 1230.083446] Call Trace:
>> [ 1230.083448] [] warn_slowpath_common+0x7f/0xc0
>> [ 1230.083452] [] warn_slowpath_fmt+0x46/0x50
>> [ 1230.083455] [] suspend_test_finish+0x86/0x90
>> [ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200
>> [ 1230.083460] [] enter_state+0xd1/0x100
>> [ 1230.083463] [] pm_suspend+0x1b/0x60
>> [ 1230.083465] [] state_store+0x45/0x70
>> [ 1230.083467] [] kobj_attr_store+0xf/0x30
>> [ 1230.083471] [] sysfs_write_file+0xef/0x170
>> [ 1230.083476] [] vfs_write+0xb3/0x180
>> [ 1230.083480] [] sys_write+0x4a/0x90
>> [ 1230.083483] [] system_call_fastpath+0x16/0x1b
>> [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]---
>>
>> Boot with init=/bin/bash
>> unload all modules except USBHID
>> echo core > /sys/power/pm_test
>> echo mem > /sys/power/state
>> system acts like it is going to sleep, and then wakes up a few seconds later
>> echo none > /sys/power/pm_test
>> echo mem > /sys/power/state
>> system goes to sleep
>> press power to resume = fails
>>
>> At this point I am stumped on how to debug. This is a "modern"
>> computer with no serial ports. It worked under Lucid, so I know it is
>> POSSIBLE.
>>
>> Mobo: ASRock X58 single-socket
>> CPU: Westmere 6 core (12 hyperthreads) 3.2 GHz
>> RAM: 12 GB ECC
>> Disk: sda = Intel SSD, mounted on /
>> Disk: sdb = Intel SSD, not mounted
>> Disk: sdc = Seagate HDD, not mounted
>> Disk: sdd = Seagate HDD, not mounted
>> NIC = Onboard RTL8168e/8111e
>> Sound = EMU1212 (emu10k1, not even configured yet)
>> Video = nVidia GeForce 7600 GT
>> KB = PS2 (also tried USB)
>> Mouse = USB
>>
>> I have not updated to a more current kernel than 3.5, but I will if
>> there's evidence that this is resolved.  Any other clever trick to
>> try?
>
> There is no evidence and there won't be if you don't try a newer kernel.
>
> Thanks,
> Rafael
>
>
> --
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin
Quick update: booting with 'noapic' on the commandline seems to make
it resume successfully.

The main dmesg diffs, other than the obvious "Skipping IOAPIC probe"
and IRG number diffs) are:

-nr_irqs_gsi: 40
+nr_irqs_gsi: 16

-NR_IRQS:16640 nr_irqs:776 16
+NR_IRQS:16640 nr_irqs:368 16

-system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved
+system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved

and a new warning about irq 5: nobody cared (try booting with the
"irqpoll" option)

I'll see if I can sort out further differences, but I thought it was
worth sending this new info along, anyway.

It did not require 'noapic' on the Lucid (2.6.32?) kernel


On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin  wrote:
> Running a suspend with pm_trace set, I get:
>
> aer :00:03.0:pcie02: hash matches
>
> I don't know what magic might be needed here, though.
>
> I guess next step is to try to build a non-distro kernel.
>
> On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki  wrote:
>> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
>>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
>>> resume worked flawlessly every time.
>>>
>>> Then I upgraded to Ubuntu Precise.
>>
>> Well, do you use a distro kernel or a kernel.org kernel?
>>
>>> Suspend seems to work, but resume
>>> fails every time. The video never initializes.  By the flashing
>>> keyboard lights, I guess it's a kernel panic. It fails from the Live
>>> CD and from a fresh install.
>>>
>>> Here is my debug so far.
>>>
>>> Install all updates (3.2 kernel, nouveau driver)
>>> Reboot
>>> Try suspend = fails
>>>
>>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
>>> Reboot
>>> Try suspend = fails
>>>
>>> Install nVidia's 304 driver
>>> Reboot
>>> Try suspend = fails
>>>
>>> From within X:
>>>   echo core > /sys/power/pm_test
>>>   echo mem > /sys/power/state
>>> The system acts like it is going to sleep, and then wakes up a few
>>> seconds later. dmesg shows:
>>>
>>> [ 1230.083404] [ cut here ]
>>> [ 1230.083410] WARNING: at
>>> /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
>>> suspend_test_finish+0x86/0x90()
>>> [ 1230.083411] Hardware name: To Be Filled By O.E.M.
>>> [ 1230.083412] Component: resume devices, time: 14424
>>> [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
>>> snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
>>> nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
>>> snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
>>> snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
>>> ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
>>> bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
>>> shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
>>> pata_marvell
>>> [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
>>> #32~precise1-Ubuntu
>>> [ 1230.083446] Call Trace:
>>> [ 1230.083448] [] warn_slowpath_common+0x7f/0xc0
>>> [ 1230.083452] [] warn_slowpath_fmt+0x46/0x50
>>> [ 1230.083455] [] suspend_test_finish+0x86/0x90
>>> [ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200
>>> [ 1230.083460] [] enter_state+0xd1/0x100
>>> [ 1230.083463] [] pm_suspend+0x1b/0x60
>>> [ 1230.083465] [] state_store+0x45/0x70
>>> [ 1230.083467] [] kobj_attr_store+0xf/0x30
>>> [ 1230.083471] [] sysfs_write_file+0xef/0x170
>>> [ 1230.083476] [] vfs_write+0xb3/0x180
>>> [ 1230.083480] [] sys_write+0x4a/0x90
>>> [ 1230.083483] [] system_call_fastpath+0x16/0x1b
>>> [ 1230.083488] ---[ end trace 839cdd0078b3ce03 ]---
>>>
>>> Boot with init=/bin/bash
>>> unload all modules except USBHID
>>> echo core > /sys/power/pm_test
>>> echo mem > /sys/power/state
>>> system acts like it is going to sleep, and then wakes up a few seconds later
>>> echo none > /sys/power/pm_test
>>> echo mem > /sys/power/state
>>> system goes to sleep
>>> press power to resume = fails
>>>
>>> At this point I am stumped on how to debug. This is a "modern"
>>> computer with no serial ports. It worked under Lucid, so I know it is
>>> POSSIBLE.
>>>
>>> Mobo: AS

Re: kernel panic on resume from S3 - stumped

2012-12-29 Thread Tim Hockin
Best guess:

With 'noapic', I see the "irq 5: nobody cared" message on resume,
along with 1 IRQ5 counts in /proc/interrupts (the devices claiming
that IRQ are quiescent).

Without 'noapic' that must be triggering something else to go haywire,
perhaps the AER logic (though that is all MSI, so probably not).  I'm
flying blind on those boots.

I bet that, if I can recall how to re-enable IRQ5, I'll see it
continuously asserting.  Chipset or BIOS bug maybe.  I don't know if I
had AER enabled under Lucid, so that might be the difference.

I'll try a vanilla kernel next, maybe hack on AER a bit, to see if I
can make it progress.


On Sat, Dec 29, 2012 at 10:19 PM, Tim Hockin  wrote:
> Quick update: booting with 'noapic' on the commandline seems to make
> it resume successfully.
>
> The main dmesg diffs, other than the obvious "Skipping IOAPIC probe"
> and IRG number diffs) are:
>
> -nr_irqs_gsi: 40
> +nr_irqs_gsi: 16
>
> -NR_IRQS:16640 nr_irqs:776 16
> +NR_IRQS:16640 nr_irqs:368 16
>
> -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved
> +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved
>
> and a new warning about irq 5: nobody cared (try booting with the
> "irqpoll" option)
>
> I'll see if I can sort out further differences, but I thought it was
> worth sending this new info along, anyway.
>
> It did not require 'noapic' on the Lucid (2.6.32?) kernel
>
>
> On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin  wrote:
>> Running a suspend with pm_trace set, I get:
>>
>> aer :00:03.0:pcie02: hash matches
>>
>> I don't know what magic might be needed here, though.
>>
>> I guess next step is to try to build a non-distro kernel.
>>
>> On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki  wrote:
>>> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
>>>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
>>>> resume worked flawlessly every time.
>>>>
>>>> Then I upgraded to Ubuntu Precise.
>>>
>>> Well, do you use a distro kernel or a kernel.org kernel?
>>>
>>>> Suspend seems to work, but resume
>>>> fails every time. The video never initializes.  By the flashing
>>>> keyboard lights, I guess it's a kernel panic. It fails from the Live
>>>> CD and from a fresh install.
>>>>
>>>> Here is my debug so far.
>>>>
>>>> Install all updates (3.2 kernel, nouveau driver)
>>>> Reboot
>>>> Try suspend = fails
>>>>
>>>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
>>>> Reboot
>>>> Try suspend = fails
>>>>
>>>> Install nVidia's 304 driver
>>>> Reboot
>>>> Try suspend = fails
>>>>
>>>> From within X:
>>>>   echo core > /sys/power/pm_test
>>>>   echo mem > /sys/power/state
>>>> The system acts like it is going to sleep, and then wakes up a few
>>>> seconds later. dmesg shows:
>>>>
>>>> [ 1230.083404] [ cut here ]
>>>> [ 1230.083410] WARNING: at
>>>> /build/buildd/linux-lts-quantal-3.5.0/kernel/power/suspend_test.c:53
>>>> suspend_test_finish+0x86/0x90()
>>>> [ 1230.083411] Hardware name: To Be Filled By O.E.M.
>>>> [ 1230.083412] Component: resume devices, time: 14424
>>>> [ 1230.083412] Modules linked in: snd_emu10k1_synth snd_emux_synth
>>>> snd_seq_virmidi snd_seq_midi_emul bnep rfcomm parport_pc ppdev
>>>> nvidia(PO) snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm snd_page_alloc
>>>> snd_util_mem snd_hwdep snd_seq_midi snd_rawmidi snd_seq_midi_event
>>>> snd_seq snd_timer coretemp snd_seq_device kvm_intel kvm snd
>>>> ghash_clmulni_intel soundcore aesni_intel btusb cryptd aes_x86_64
>>>> bluetooth i7core_edac edac_core microcode mac_hid lpc_ich mxm_wmi
>>>> shpchp serio_raw wmi hid_generic lp parport usbhid hid r8169
>>>> pata_marvell
>>>> [ 1230.083445] Pid: 3329, comm: bash Tainted: P O 3.5.0-21-generic
>>>> #32~precise1-Ubuntu
>>>> [ 1230.083446] Call Trace:
>>>> [ 1230.083448] [] warn_slowpath_common+0x7f/0xc0
>>>> [ 1230.083452] [] warn_slowpath_fmt+0x46/0x50
>>>> [ 1230.083455] [] suspend_test_finish+0x86/0x90
>>>> [ 1230.083457] [] suspend_devices_and_enter+0x10b/0x200
>>>> [ 1230.083460] [] enter_state+0xd1/0x100
>>>> [ 12

Re: kernel panic on resume from S3 - stumped

2012-12-30 Thread Tim Hockin
On Sun, Dec 30, 2012 at 2:55 PM, Rafael J. Wysocki  wrote:
> On Saturday, December 29, 2012 11:17:11 PM Tim Hockin wrote:
>> Best guess:
>>
>> With 'noapic', I see the "irq 5: nobody cared" message on resume,
>> along with 1 IRQ5 counts in /proc/interrupts (the devices claiming
>> that IRQ are quiescent).
>>
>> Without 'noapic' that must be triggering something else to go haywire,
>> perhaps the AER logic (though that is all MSI, so probably not).  I'm
>> flying blind on those boots.
>>
>> I bet that, if I can recall how to re-enable IRQ5, I'll see it
>> continuously asserting.  Chipset or BIOS bug maybe.  I don't know if I
>> had AER enabled under Lucid, so that might be the difference.
>>
>> I'll try a vanilla kernel next, maybe hack on AER a bit, to see if I
>> can make it progress.
>
> I wonder what happens if you simply disable AER for starters?
>
> There is the pci=noaer kernel command line switch for that.

That still panics on resume.  Damn.  I really think it is down to that
interrupt storm at resume.  Something somewhere is getting stuck
asserting, and we don't know how to EOI it.  PIC vs APIC is just
changing the operating mode.

Now the question is whether I am going to track through Intel errata
(more than I have already) and through chipset docs to figure out what
it could be, or just leave it at noapic.

I've already got one new PCI quirk to code up.

> Thanks,
> Rafael
>
>
>> On Sat, Dec 29, 2012 at 10:19 PM, Tim Hockin  wrote:
>> > Quick update: booting with 'noapic' on the commandline seems to make
>> > it resume successfully.
>> >
>> > The main dmesg diffs, other than the obvious "Skipping IOAPIC probe"
>> > and IRG number diffs) are:
>> >
>> > -nr_irqs_gsi: 40
>> > +nr_irqs_gsi: 16
>> >
>> > -NR_IRQS:16640 nr_irqs:776 16
>> > +NR_IRQS:16640 nr_irqs:368 16
>> >
>> > -system 00:0a: [mem 0xfec0-0xfec00fff] could not be reserved
>> > +system 00:0a: [mem 0xfec0-0xfec00fff] has been reserved
>> >
>> > and a new warning about irq 5: nobody cared (try booting with the
>> > "irqpoll" option)
>> >
>> > I'll see if I can sort out further differences, but I thought it was
>> > worth sending this new info along, anyway.
>> >
>> > It did not require 'noapic' on the Lucid (2.6.32?) kernel
>> >
>> >
>> > On Sat, Dec 29, 2012 at 9:34 PM, Tim Hockin  wrote:
>> >> Running a suspend with pm_trace set, I get:
>> >>
>> >> aer :00:03.0:pcie02: hash matches
>> >>
>> >> I don't know what magic might be needed here, though.
>> >>
>> >> I guess next step is to try to build a non-distro kernel.
>> >>
>> >> On Sat, Dec 29, 2012 at 1:57 PM, Rafael J. Wysocki  wrote:
>> >>> On Saturday, December 29, 2012 12:03:13 PM Tim Hockin wrote:
>> >>>> 4 days ago I had Ubuntu Lucid running on this computer. Suspend and
>> >>>> resume worked flawlessly every time.
>> >>>>
>> >>>> Then I upgraded to Ubuntu Precise.
>> >>>
>> >>> Well, do you use a distro kernel or a kernel.org kernel?
>> >>>
>> >>>> Suspend seems to work, but resume
>> >>>> fails every time. The video never initializes.  By the flashing
>> >>>> keyboard lights, I guess it's a kernel panic. It fails from the Live
>> >>>> CD and from a fresh install.
>> >>>>
>> >>>> Here is my debug so far.
>> >>>>
>> >>>> Install all updates (3.2 kernel, nouveau driver)
>> >>>> Reboot
>> >>>> Try suspend = fails
>> >>>>
>> >>>> Install Ubuntu's linux-generic-lts-quantal (3.5 kernel, nouveau driver)
>> >>>> Reboot
>> >>>> Try suspend = fails
>> >>>>
>> >>>> Install nVidia's 304 driver
>> >>>> Reboot
>> >>>> Try suspend = fails
>> >>>>
>> >>>> From within X:
>> >>>>   echo core > /sys/power/pm_test
>> >>>>   echo mem > /sys/power/state
>> >>>> The system acts like it is going to sleep, and then wakes up a few
>> >>>> seconds later. dmesg shows:
>> >>>>
>> >>>> [ 1230.083404] [ cut 

Re: cgroup: status-quo and userland efforts

2013-04-22 Thread Tim Hockin
Hi Tejun,

This email worries me.  A lot.  It sounds very much like retrograde
motion from our (Google's) point of view.

We absolutely depend on the ability to split cgroup hierarchies.  It
pretty much saved our fleet from imploding, in a way that a unified
hierarchy just could not do.  A mandated unified hierarchy is madness.
 Please step away from the ledge.

More, going towards a unified hierarchy really limits what we can
delegate, and that is the word of the day.  We've got a central
authority agent running which manages cgroups, and we want out of this
business.  At least, we want to be able to grant users a set of
constraints, and then let them run wild within those constraints.
Forcing all such work to go through a daemon has proven to be very
problematic, and it has been great now that users can have DIY
sub-cgroups.

berra...@redhat.com said, downthread:

> We ultimately do need the ability to delegate hierarchy creation to
> unprivileged users / programs, in order to allow containerized OS to
> have the ability to use cgroups. Requiring any applications inside a
> container to talk to a cgroups "authority" existing on the host OS is
> not a satisfactory architecture. We need to allow for a container to
> be self-contained in its usage of cgroups.

This!  A thousand times, this!

> At the same time, we don't need/want to give them unrestricted ability
> to create arbitarily complex hiearchies - we need some limits on it
> to avoid them exposing pathelogically bad kernel behaviour.
>
> This could be as simple as saying that each cgroup controller directory
> has a tunable "cgroups.max_children" and/or "cgroups.max_depth" which
> allow limits to be placed when delegating administration of part of a
>cgroups tree to an unprivileged user.

We've been bitten by this, and more limitations would be great.  We've
got some less-than-perfect patches that impose limits for us now.

> I've no disagreement that we need a unified hiearchy. The workman
> app explicitly does /not/ expose the concept of differing hiearchies
> per controller. Likewise libvirt will not allow the user to configure
> non-unified hiearchies.

Strong disagreement, here.  We use split hierarchies to great effect.
Containment should be composable.  If your users or abstractions can't
handle it, please feel free to co-mount the universe, but please
PLEASE don't force us to.

I'm happy to talk more about what we do and why.

Tim




On Sat, Apr 6, 2013 at 3:21 AM, Tejun Heo  wrote:
> Hello, guys.
>
>  Status-quo
>  ==
>
> It's been about a year since I wrote up a summary on cgroup status quo
> and future plans.  We're not there yet but much closer than we were
> before.  At least the locking and object life-time management aren't
> crazy anymore and most controllers now support proper hierarchy
> although not all of them agree on how to treat inheritance.
>
> IIRC, the yet-to-be-converted ones are blk-throttle and perf.  cpu
> needs to be updated so that it at least supports a similar mechanism
> as cfq-iosched for configuring ratio between tasks on an internal
> cgroup and its children.  Also, we really should update how cpuset
> handles a cgroup becoming empty (no cpus or memory node left due to
> hot-unplug).  It currently transfers all its tasks to the nearest
> ancestor with executing resources, which is an irreversible process
> which would affect all other co-mounted controllers.  We probably want
> it to just take on the masks of the ancestor until its own executing
> resources become online again, and the new behavior should be gated
> behind a switch (Li, can you please look into this?).
>
> While we have still ways to go, I feel relatively confident saying
> that we aren't too far out now, well, except for the writeback mess
> that still needs to be tackled.  Anyways, once the remaining bits are
> settled, we can proceed to implement the unified hierarchy mode I've
> been talking about forever.  I can't think of any fundamental
> roadblocks at the moment but who knows?  The devil usually is in the
> details.  Let's hope it goes okay.
>
> So, while we aren't moving as fast as we wish we were, the kernel side
> of things are falling into places.  At least, that's how I see it.
> From now on, I think how to make it actually useable to userland
> deserves a bit more focus, and by "useable to userland", I don't mean
> some group hacking up an elaborate, manual configuration which is
> tailored to the point of being eccentric to suit the needs of the said
> group.  There's nothing wrong with that and they can continue to do
> so, but it just isn't generically useable or useful.  It should be
> possible to generically and automatically split resources among, say,
> several servers and a couple users sharing a system without resorting
> to indecipherable ad-hoc shell script running off rc.local.
>
>
>  Userland efforts
>  
>
> There are currently a few userland efforts trying to make interfacing
> with cgroup le

Re: cgroup: status-quo and userland efforts

2013-04-22 Thread Tim Hockin
On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote:
>> We absolutely depend on the ability to split cgroup hierarchies.  It
>> pretty much saved our fleet from imploding, in a way that a unified
>> hierarchy just could not do.  A mandated unified hierarchy is madness.
>>  Please step away from the ledge.
>
> You need to be a lot more specific about why unified hierarchy can't
> be implemented.  The last time I asked around blk/memcg people in
> google, while they said that they'll need different levels of
> granularities for different controllers, google's use of cgroup
> doesn't require multiple orthogonal classifications of the same group
> of tasks.

I'll pull some concrete examples together.  I don't have them on hand,
and I am out of country this week.  I have looped in the gang at work
(though some are here with me).

> Also, cgroup isn't dropping multiple hierarchy support over-night.
> What has been working till now will continue to work for very long
> time.  If there is no fundamental conflict with the future changes,
> there should be enough time to migrate gradually as desired.
>
>> More, going towards a unified hierarchy really limits what we can
>> delegate, and that is the word of the day.  We've got a central
>> authority agent running which manages cgroups, and we want out of this
>> business.  At least, we want to be able to grant users a set of
>> constraints, and then let them run wild within those constraints.
>> Forcing all such work to go through a daemon has proven to be very
>> problematic, and it has been great now that users can have DIY
>> sub-cgroups.
>
> Sorry, but that doesn't work properly now.  It gives you the illusion
> of proper delegation but it's inherently dangerous.  If that sort of
> illusion has been / is good enough for your setup, fine.  Delegate at
> your own risks, but cgroup in itself doesn't support delegation to
> lesser security domains and it won't in the foreseeable future.

We've had great success letting users create sub-cgroups in a few
specific controller types (cpu, cpuacct, memory).  This is, of course,
with some restrictions.  We do not just give them blanket access to
all knobs.  We don't need ALL cgroups, just the important ones.

For a simple example, letting users create sub-groups in freezer or
job (we have a job group that we've been carrying) lets them launch
sub-tasks and manage them in a very clean way.

We've been doing a LOT of development internally to make user-defined
sub-memcgs work in our cluster scheduling system, and it's made some
of our biggest, more insane users very happy.

And for some cgroups, like cpuset, hierarchy just doesn't really make
sense to me.  I just don't care if that never works, though I have no
problem with others wanting it. :)   Aside: if the last CPU in your
cpuset goes offline, you should go into a state akin to freezer.
Running on any other CPU is an overt violation of policy that the
user, or worse - the admin, set up.  Just my 2cents.

>> Strong disagreement, here.  We use split hierarchies to great effect.
>> Containment should be composable.  If your users or abstractions can't
>> handle it, please feel free to co-mount the universe, but please
>> PLEASE don't force us to.
>>
>> I'm happy to talk more about what we do and why.
>
> Please do so.  Why do you need multiple orthogonal hierarchies?

Look for this in the next few days/weeks.  From our point of view,
cgroups are the ideal match for how we want to manage things (no
surprise, really, since Mr. Menage worked on both).

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cgroup: status-quo and userland efforts

2013-06-22 Thread Tim Hockin
I'm very sorry I let this fall off my plate.  I was pointed at a
systemd-devel message indicating that this is done.  Is it so?  It
seems so completely ass-backwards to me. Below is one of our use-cases
that I just don't see how we can reproduce in a single-heierarchy.
We're also long into the model that users can control their own
sub-cgroups (moderated by permissions decided by admin SW up front).

We have classes of jobs which can run together on shared machines.  This is
VERY important to us, and is a key part of how we run things.  Over the years
we have evolved from very little isolation to fairly strong isolation, and
cgroups are a large part of that.

We have experienced and adapted to a number of problems around isolation over
time.  I won't go into the history of all of these, because it's not so
relevant, but here is how we set things up today.

>From a CPU perspective, we have two classes of jobs: production and batch.
Production jobs can (but don't always) ask for exclusive cores, which ensures
that no batch work runs on those CPUs.  We manage this with the cpuset cgroup.
Batch jobs are relegated to the set of CPUs that are "left-over" after
exclusivity rules are applied.  This is implemented with a shared subdirectory
of the cpuset cgroup called "batch".  Production jobs get their own
subdirectories under cpuset.

>From an IO perspective we also have two classes of jobs: normal and
DTF-approved.  Normal jobs do not get strong isolation for IO, whereas
DTF-enabled jobs do.  The vast majority of jobs are NOT DTF-enabled, and they
share a nominal amount of IO bandwidth.  This is implemented with a shared
subdirectory of the io cgroup called "default".  Jobs that are DTF-enabled get
their own subdirectories under IO.

This gives us 4 combinations:
  1) { production, DTF }
  2) { production, non-DTF }
  3) { batch, DTF }
  4) { batch non-DTF }

Of these, (3) is sort of nonsense, but the others are actually used
and needed.  This is only
possible because of split hierarchies.  In fact, we undertook a very painful
process to move from a unified cgroup hierarchy to split hierarchies in large
part _because of_ these examples.

And for more fun, I am simplifying this all. Batch jobs are actually bound to
NUMA-node specific cpuset cgroups when possible.  And we have a similar
concept for the cpu cgroup as for cpuset.  And we have a third tier of IO
jobs.  We don't do all of this for fun - it is in direct response to REAL
problems we have experienced.

Making cgroups composable allows us to build a higher level abstraction that
is very powerful and flexible.  Moving back to unified hierarchies goes
against everything that we're doing here, and will cause us REAL pain.


On Mon, Apr 22, 2013 at 3:33 PM, Tim Hockin  wrote:
> On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo  wrote:
>> Hello, Tim.
>>
>> On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote:
>>> We absolutely depend on the ability to split cgroup hierarchies.  It
>>> pretty much saved our fleet from imploding, in a way that a unified
>>> hierarchy just could not do.  A mandated unified hierarchy is madness.
>>>  Please step away from the ledge.
>>
>> You need to be a lot more specific about why unified hierarchy can't
>> be implemented.  The last time I asked around blk/memcg people in
>> google, while they said that they'll need different levels of
>> granularities for different controllers, google's use of cgroup
>> doesn't require multiple orthogonal classifications of the same group
>> of tasks.
>
> I'll pull some concrete examples together.  I don't have them on hand,
> and I am out of country this week.  I have looped in the gang at work
> (though some are here with me).
>
>> Also, cgroup isn't dropping multiple hierarchy support over-night.
>> What has been working till now will continue to work for very long
>> time.  If there is no fundamental conflict with the future changes,
>> there should be enough time to migrate gradually as desired.
>>
>>> More, going towards a unified hierarchy really limits what we can
>>> delegate, and that is the word of the day.  We've got a central
>>> authority agent running which manages cgroups, and we want out of this
>>> business.  At least, we want to be able to grant users a set of
>>> constraints, and then let them run wild within those constraints.
>>> Forcing all such work to go through a daemon has proven to be very
>>> problematic, and it has been great now that users can have DIY
>>> sub-cgroups.
>>
>> Sorry, but that doesn't work properly now.  It gives you the illusion
>> of proper delegation but it's inherently dangerous.  If that

Re: cgroup: status-quo and userland efforts

2013-06-24 Thread Tim Hockin
On Mon, Jun 24, 2013 at 5:01 PM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Sat, Jun 22, 2013 at 04:13:41PM -0700, Tim Hockin wrote:
>> I'm very sorry I let this fall off my plate.  I was pointed at a
>> systemd-devel message indicating that this is done.  Is it so?  It
>
> It's progressing pretty fast.
>
>> seems so completely ass-backwards to me. Below is one of our use-cases
>> that I just don't see how we can reproduce in a single-heierarchy.
>
> Configurations which depend on orthogonal multiple hierarchies of
> course won't be replicated under unified hierarchy.  It's unfortunate
> but those just have to go.  More on this later.

I really want to understand why this is SO IMPORTANT that you have to
break userspace compatibility?  I mean, isn't Linux supposed to be the
OS with the stable kernel interface?  I've seen Linus rant time and
time again about this - why is it OK now?

>> We're also long into the model that users can control their own
>> sub-cgroups (moderated by permissions decided by admin SW up front).
>
> If you're in control of the base system, nothing prevents you from
> doing so.  It's utterly broken security and policy-enforcement point
> of view but if you can trust each software running on your system to
> do the right thing, it's gonna be fine.

Examples?  we obviously don't grant full access, but our kernel gang
and security gang seem to trust the bits we're enabling well enough...

>> This gives us 4 combinations:
>>   1) { production, DTF }
>>   2) { production, non-DTF }
>>   3) { batch, DTF }
>>   4) { batch non-DTF }
>>
>> Of these, (3) is sort of nonsense, but the others are actually used
>> and needed.  This is only
>> possible because of split hierarchies.  In fact, we undertook a very painful
>> process to move from a unified cgroup hierarchy to split hierarchies in large
>> part _because of_ these examples.
>
> You can create three sibling cgroups and configure cpuset and blkio
> accordingly.  For cpuset, the setup wouldn't make any different.  For
> blkio, the two non-DTFs would now belong to different cgroups and
> compete with each other as two groups, which won't matter at all as
> non-DTFs are given what's left over after serving DTFs anyway, IIRC.

The non-DTF jobs have a combined share that is small but non-trivial.
If we cut that share in half, giving one slice to prod and one slice
to batch, we get bad sharing under contention.  We tried this.  We
could add control loops in userspace code which try to balance the
shares in proportion to the load.  We did that with CPU, and it's sort
of horrible.  We're moving AWAY from all this craziness in favor of
well-defined hierarchical behaviors.

>> Making cgroups composable allows us to build a higher level abstraction that
>> is very powerful and flexible.  Moving back to unified hierarchies goes
>> against everything that we're doing here, and will cause us REAL pain.
>
> Categorizing processes into hierarchical groups of tasks is a
> fundamental idea and a fundamental idea is something to base things on
> top of as it's something people can agree upon relatively easily and
> establish a structure by.  I'd go as far as saying that it's the
> failure on the part of workload design if they in general can't be
> categorized hierarchically.

It's a bit naive to think that this is some absolute truth, don't you
think?  It just isn't so.  You should know better than most what
craziness our users do, and what (legit) rationales they can produce.
I have $large_number of machines running $huge_number of jobs from
thousands of developers running for years upon years backing up my
worldview.

> Even at the practical level, the orthogonal hierarchy encouraged, at
> the very least, the blkcg writeback support which can't be upstreamed
> in any reasonable manner because it is impossible to say that a
> resource can't be said to belong to a cgroup irrespective of who's
> looking at it.

I'm not sure I really grok that statement.  I'm OK with defining new
rules that bring some order to the chaos.  Give us new rules to live
by.  All-or-nothing would be fine.  What if mounting cgroupfs gives me
N sub-dirs, one for each compiled-in controller?  You could make THAT
the mount option - you can have either a unified hierarchy of all
controllers or fully disjoint hierarchies.  Or some other rule.

> It's something fundamentally broken and I have very difficult time
> believing google's workload is so different that it can't be
> categorized in a single hierarchy for the purpose of resource
> distribution.  I'm sure there are cases where some compr

Re: cgroup: status-quo and userland efforts

2013-06-26 Thread Tim Hockin
On Wed, Jun 26, 2013 at 2:20 PM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote:
>> I really want to understand why this is SO IMPORTANT that you have to
>> break userspace compatibility?  I mean, isn't Linux supposed to be the
>> OS with the stable kernel interface?  I've seen Linus rant time and
>> time again about this - why is it OK now?
>
> What the hell are you talking about?  Nobody is breaking userland
> interface.  A new version of interface is being phased in and the old

The first assertion, as I understood, was that (eventually) cgroupfs
will not allow split hierarchies - that unified hierarchy would be the
only mode.  Is that not the case?

The second assertion, as I understood, was that (eventually) cgroupfs
would not support granting access to some cgroup control files to
users (through chown/chmod).  Is that not the case?

> one will stay there for the foreseeable future.  It will be phased out
> eventually but that's gonna take a long time and it will have to be
> something hardly noticeable.  Of course new features will only be
> available with the new interface and there will be efforts to nudge
> people away from the old one but the existing interface will keep
> working it does.

Hmm, so what exactly is changing then?  If, as you say here, the
existing interfaces will keep working - what is changing?

>> Examples?  we obviously don't grant full access, but our kernel gang
>> and security gang seem to trust the bits we're enabling well enough...
>
> Then the security gang doesn't have any clue what's going on, or at
> least operating on very different assumptions (ie. the workloads are
> trusted by default).  You can OOM the whole kernel by creating many
> cgroups, completely mess up controllers by creating deep hierarchies,
> affect your siblings by adjusting your weight and so on.  It's really
> easy to DoS the whole system if you have write access to a cgroup
> directory.

As I said, it's controlled delegated access.  And we have some patches
that we carry to prevent some of these DoS situations.

>> The non-DTF jobs have a combined share that is small but non-trivial.
>> If we cut that share in half, giving one slice to prod and one slice
>> to batch, we get bad sharing under contention.  We tried this.  We
>
> Why is that tho?  It *should* work fine and I can't think of a reason
> why that would behave particularly badly off the top of my head.
> Maybe I forgot too much of the iosched modification used in google.
> Anyways, if there's a problem, that should be fixable, right?  And
> controller-specific issues like that should really dictate the
> architectural design too much.

I actually can not speak to the details of the default IO problem, as
it happened before I really got involved.  But just think through it.
If one half of the split has 5 processes running and the other half
has 200, the processes in the 200 set each get FAR less spindle time
than those in the 5 set.  That is NOT the semantic we need.  We're
trying to offer ~equal access for users of the non-DTF class of jobs.

This is not the tail doing the wagging.  This is your assertion that
something should work, when it just doesn't.  We have two, totally
orthogonal classes of applications on two totally disjoint sets of
resources.  Conjoining them is the wrong answer.

>> could add control loops in userspace code which try to balance the
>> shares in proportion to the load.  We did that with CPU, and it's sort
>
> Yeah, that is horrible.

Yeah, I would love to explain some of the really nasty things we have
done and are moving away from.  I am not sure I am allowed to, though
:)

>> of horrible.  We're moving AWAY from all this craziness in favor of
>> well-defined hierarchical behaviors.
>
> But I don't follow the conclusion here.  For short term workaround,
> sure, but having that dictate the whole architecture decision seems
> completely backwards to me.

My point is that the orthogonality of resources is intrinsic.  Letting
"it's hard to make it work" dictate the architecture is what's
backwards.

>> It's a bit naive to think that this is some absolute truth, don't you
>> think?  It just isn't so.  You should know better than most what
>> craziness our users do, and what (legit) rationales they can produce.
>> I have $large_number of machines running $huge_number of jobs from
>> thousands of developers running for years upon years backing up my
>> worldview.
>
> If so, you aren't communicating it very well.  I've talked with quite
> a few people about multiple orthogonal hierarchies including people
> inside google.  

Re: cgroup: status-quo and userland efforts

2013-06-26 Thread Tim Hockin
On Wed, Jun 26, 2013 at 6:04 PM, Tejun Heo  wrote:
> Hello,
>
> On Wed, Jun 26, 2013 at 05:06:02PM -0700, Tim Hockin wrote:
>> The first assertion, as I understood, was that (eventually) cgroupfs
>> will not allow split hierarchies - that unified hierarchy would be the
>> only mode.  Is that not the case?
>
> No, unified hierarchy would be an optional thing for quite a while.
>
>> The second assertion, as I understood, was that (eventually) cgroupfs
>> would not support granting access to some cgroup control files to
>> users (through chown/chmod).  Is that not the case?
>
> Again, it'll be an opt-in thing.  The hierarchy controller would be
> able to notice that and issue warnings if it wants to.
>
>> Hmm, so what exactly is changing then?  If, as you say here, the
>> existing interfaces will keep working - what is changing?
>
> New interface is being added and new features will be added only for
> the new interface.  The old one will eventually be deprecated and
> removed, but that *years* away.

OK, then what I don't know is what is the new interface?  A new cgroupfs?

>> As I said, it's controlled delegated access.  And we have some patches
>> that we carry to prevent some of these DoS situations.
>
> I don't know.  You can probably hack around some of the most serious
> problems but the whole thing isn't built for proper delgation and
> that's not the direction the upstream kernel is headed at the moment.
>
>> I actually can not speak to the details of the default IO problem, as
>> it happened before I really got involved.  But just think through it.
>> If one half of the split has 5 processes running and the other half
>> has 200, the processes in the 200 set each get FAR less spindle time
>> than those in the 5 set.  That is NOT the semantic we need.  We're
>> trying to offer ~equal access for users of the non-DTF class of jobs.
>>
>> This is not the tail doing the wagging.  This is your assertion that
>> something should work, when it just doesn't.  We have two, totally
>> orthogonal classes of applications on two totally disjoint sets of
>> resources.  Conjoining them is the wrong answer.
>
> As I've said multiple times, there sure are things that you cannot
> achieve without orthogonal multiple hierarchies, but given the options
> we have at hands, compromising inside a unified hierarchy seems like
> the best trade-off.  Please take a step back from the immediate detail
> and think of the general hierarchical organization of workloads.  If
> DTF / non-DTF is a fundamental part of your workload classfication,
> that should go above.

DTF and CPU and cpuset all have "default" groups for some tasks (and
not others) in our world today.  DTF actually has default, prio, and
"normal".  I was simplifying before.  I really wish it were as simple
as you think it is.  But if it were, do you think I'd still be
arguing?

> I don't really understand your example anyway because you can classify
> by DTF / non-DTF first and then just propagate cpuset settings along.
> You won't lose anything that way, right?

This really doesn't scale when I have thousands of jobs running.
Being able to disable at some levels on some controllers probably
helps some, but I can't say for sure without knowing the new interface

> Again, in general, you might not be able to achieve *exactly* what
> you've been doing, but, an acceptable compromise should be possible
> and not doing so leads to complete mess.

We tried it in unified hierarchy.  We had our Top People on the
problem.  The best we could get was bad enough that we embarked on a
LITERAL 2 year transition to make it better.

>> > But I don't follow the conclusion here.  For short term workaround,
>> > sure, but having that dictate the whole architecture decision seems
>> > completely backwards to me.
>>
>> My point is that the orthogonality of resources is intrinsic.  Letting
>> "it's hard to make it work" dictate the architecture is what's
>> backwards.
>
> No, it's not "it's hard to make it work".  It's more "it's
> fundamentally broken".  You can't identify a resource to be belonging
> to a cgroup independent of who's looking at the resource.

What if you could ensure that for a given TID (or PID if required) in
dir X of controller C, all of the other TIDs in that cgroup were in
the same group, but maybe not the same sub-path, under every
controller?  This gives you what it sounds like you wanted elsewhere -
a container abstraction.

In other words, define a container as a set of cgroups, one under each
each active cont

Re: cgroup: status-quo and userland efforts

2013-06-27 Thread Tim Hockin
On Thu, Jun 27, 2013 at 6:22 AM, Serge Hallyn  wrote:
> Quoting Mike Galbraith (bitbuc...@online.de):
>> On Wed, 2013-06-26 at 14:20 -0700, Tejun Heo wrote:
>> > Hello, Tim.
>> >
>> > On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote:
>> > > I really want to understand why this is SO IMPORTANT that you have to
>> > > break userspace compatibility?  I mean, isn't Linux supposed to be the
>> > > OS with the stable kernel interface?  I've seen Linus rant time and
>> > > time again about this - why is it OK now?
>> >
>> > What the hell are you talking about?  Nobody is breaking userland
>> > interface.  A new version of interface is being phased in and the old
>> > one will stay there for the foreseeable future.  It will be phased out
>> > eventually but that's gonna take a long time and it will have to be
>> > something hardly noticeable.  Of course new features will only be
>> > available with the new interface and there will be efforts to nudge
>> > people away from the old one but the existing interface will keep
>> > working it does.
>>
>> I can understand some alarm.  When I saw the below I started frothing at
>> the face and howling at the moon, and I don't even use the things much.
>>
>> http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html
>>
>> Hierarchy layout aside, that "private property" bit says that the folks
>> who currently own and use the cgroups interface will lose direct access
>> to it.  I can imagine folks who have become dependent upon an on the fly
>> management agents of their own design becoming a tad alarmed.
>
> FWIW, the code is too embarassing yet to see daylight, but I'm playing
> with a very lowlevel cgroup manager which supports nesting itself.
> Access in this POC is low-level ("set freezer.state to THAWED for cgroup
> /c1/c2", "Create /c3"), but the key feature is that it can run in two
> modes - native mode in which it uses cgroupfs, and child mode where it
> talks to a parent manager to make the changes.

In this world, are users able to read cgroup files, or do they have to
go through a central agent, too?

> So then the idea would be that userspace (like libvirt and lxc) would
> talk over /dev/cgroup to its manager.  Userspace inside a container
> (which can't actually mount cgroups itself) would talk to its own
> manager which is talking over a passed-in socket to the host manager,
> which in turn runs natively (uses cgroupfs, and nests "create /c1" under
> the requestor's cgroup).

How do you handle updates of this agent?  Suppose I have hundreds of
running containers, and I want to release a new version of the cgroupd
?

(note: inquiries about the implementation do not denote acceptance of
the model :)

> At some point (probably soon) we might want to talk about a standard API
> for these things.  However I think it will have to come in the form of
> a standard library, which knows to either send requests over dbus to
> systemd, or over /dev/cgroup sock to the manager.
>
> -serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


cgroup access daemon

2013-06-27 Thread Tim Hockin
Changing the subject, so as not to mix two discussions

On Thu, Jun 27, 2013 at 9:18 AM, Serge Hallyn  wrote:
>
>> > FWIW, the code is too embarassing yet to see daylight, but I'm playing
>> > with a very lowlevel cgroup manager which supports nesting itself.
>> > Access in this POC is low-level ("set freezer.state to THAWED for cgroup
>> > /c1/c2", "Create /c3"), but the key feature is that it can run in two
>> > modes - native mode in which it uses cgroupfs, and child mode where it
>> > talks to a parent manager to make the changes.
>>
>> In this world, are users able to read cgroup files, or do they have to
>> go through a central agent, too?
>
> The agent won't itself do anything to stop access through cgroupfs, but
> the idea would be that cgroupfs would only be mounted in the agent's
> mntns.  My hope would be that the libcgroup commands (like cgexec,
> cgcreate, etc) would know to talk to the agent when possible, and users
> would use those.

For our use case this is a huge problem.  We have people who access
cgroup files in a fairly tight loops, polling for information.  We
have literally hundeds of jobs running on sub-second frequencies -
plumbing all of that through a daemon is going to be a disaster.
Either your daemon becomes a bottleneck, or we have to build something
far more scalable than you really want to.  Not to mention the
inefficiency of inserting a layer.

We also need the ability to set up eventfds for users or to let them
poll() on the socket from this daemon.

>> > So then the idea would be that userspace (like libvirt and lxc) would
>> > talk over /dev/cgroup to its manager.  Userspace inside a container
>> > (which can't actually mount cgroups itself) would talk to its own
>> > manager which is talking over a passed-in socket to the host manager,
>> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under
>> > the requestor's cgroup).
>>
>> How do you handle updates of this agent?  Suppose I have hundreds of
>> running containers, and I want to release a new version of the cgroupd
>> ?
>
> This may change (which is part of what I want to investigate with some
> POC), but right now I'm building any controller-aware smarts into it.  I
> think that's what you're asking about?  The agent doesn't do "slices"
> etc.  This may turn out to be insufficient, we'll see.

No, what I am asking is a release-engineering problem.  Suppose we
need to roll out a new version of this daemon (some new feature or a
bug or something).  We have hundreds of these "child" agents running
in the job containers.

How do I bring down all these children, and then bring them back up on
a new version in a way that does not disrupt user jobs (much)?

Similarly, what happens when one of these child agents crashes?  Does
someone restart it?  Do user jobs just stop working?

>
> So the only state which the agent stores is a list of cgroup mounts (if
> in native mode) or an open socket to the parent (if in child mode), and a
> list of connected children sockets.
>
> HUPping the agent will cause it to reload the cgroupfs mounts (in case
> you've mounted a new controller, living in "the old world" :).  If you
> just kill it and start a new one, it shouldn't matter.
>
>> (note: inquiries about the implementation do not denote acceptance of
>> the model :)
>
> To put it another way, the problem I'm solving (for now) is not the "I
> want a daemon to ensure that requested guarantees are correctly
> implemented." In that sense I'm maintaining the status quo, i.e. the
> admin needs to architect the layout correctly.
>
> The problem I'm solving is really that I want containers to be able to
> handle cgroups even if they can't mount cgroupfs, and I want all
> userspace to be able to behave the same whether they are in a container
> or not.
>
> This isn't meant as a poke in the eye of anyone who wants to address the
> other problem.  If it turns out that we (meaning "the community of
> cgroup users") really want such an agent, then we can add that.  I'm not
> convinced.
>
> What would probably be a better design, then, would be that the agent
> I'm working on can plug into a resource allocation agent.  Or, I
> suppose, the other way around.
>
>> > At some point (probably soon) we might want to talk about a standard API
>> > for these things.  However I think it will have to come in the form of
>> > a standard library, which knows to either send requests over dbus to
>> > systemd, or over /dev/cgroup sock to the manager.
>> >
>> > -serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cgroup access daemon

2013-06-27 Thread Tim Hockin
On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn  wrote:
> Quoting Tim Hockin (thoc...@hockin.org):
>
>> For our use case this is a huge problem.  We have people who access
>> cgroup files in a fairly tight loops, polling for information.  We
>> have literally hundeds of jobs running on sub-second frequencies -
>> plumbing all of that through a daemon is going to be a disaster.
>> Either your daemon becomes a bottleneck, or we have to build something
>> far more scalable than you really want to.  Not to mention the
>> inefficiency of inserting a layer.
>
> Currently you can trivially create a container which has the
> container's cgroups bind-mounted to the expected places
> (/sys/fs/cgroup/$controller) by uncommenting two lines in the
> configuration file, and handle cgroups through cgroupfs there.
> (This is what the management agent wants to be an alternative
> for)  The main deficiency there is that /proc/self/cgroups is
> not filtered, so it will show /lxc/c1 for init's cgroup, while
> the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
> is seen under the container's /sys/fs/cgroup/devices (for
> instance).  Not ideal.

I'm really saying that if your daemon is to provide a replacement for
cgroupfs direct access, it needs to be designed to be scalable.  If
we're going to get away from bind mounting cgroupfs into user
namespaces, then let's try to solve ALL the problems.

>> We also need the ability to set up eventfds for users or to let them
>> poll() on the socket from this daemon.
>
> So you'd want to be able to request updates when any cgroup value
> is changed, right?

Not necessarily ANY, but that's the terminus of this API facet.

> That's currently not in my very limited set of commands, but I can
> certainly add it, and yes it would be a simple unix sock so you can
> set up eventfd, select/poll, etc.

Assuming the protocol is basically a pass-through to basic filesystem
ops, it should be pretty easy.  You just need to add it to your
protocol.

But it brings up another point - access control.  How do you decide
which files a child agent should have access to?  Does that ever
change based on the child's configuration? In our world, the answer is
almost certainly yes.

>> >> > So then the idea would be that userspace (like libvirt and lxc) would
>> >> > talk over /dev/cgroup to its manager.  Userspace inside a container
>> >> > (which can't actually mount cgroups itself) would talk to its own
>> >> > manager which is talking over a passed-in socket to the host manager,
>> >> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under
>> >> > the requestor's cgroup).
>> >>
>> >> How do you handle updates of this agent?  Suppose I have hundreds of
>> >> running containers, and I want to release a new version of the cgroupd
>> >> ?
>> >
>> > This may change (which is part of what I want to investigate with some
>> > POC), but right now I'm building any controller-aware smarts into it.  I
>> > think that's what you're asking about?  The agent doesn't do "slices"
>> > etc.  This may turn out to be insufficient, we'll see.
>>
>> No, what I am asking is a release-engineering problem.  Suppose we
>> need to roll out a new version of this daemon (some new feature or a
>> bug or something).  We have hundreds of these "child" agents running
>> in the job containers.
>
> When I say "container" I mean an lxc container, with it's own isolated
> rootfs and mntns.  I'm not sure what your "containers" are, but I if
> they're not that, then they shouldn't need to run a child agent.  They
> can just talk over the host cgroup agent's socket.

If they talk over the host agent's socket, where is the access control
and restriction done?  Who decides how deep I can nest groups?  Who
says which files I may access?  Who stops me from modifying someone
else's container?

Our containers are somewhat thinner and more managed than LXC, but not
that much.  If we're running a system agent in a user container, we
need to manage that software.  We can't just start up a version and
leave it running until the user decides to upgrade - we force
upgrades.

>> How do I bring down all these children, and then bring them back up on
>> a new version in a way that does not disrupt user jobs (much)?
>>
>> Similarly, what happens when one of these child agents crashes?  Does
>> someone restart it?  Do user jobs just stop working?
>
> An upstart^W$init_system job will

Re: cgroup: status-quo and userland efforts

2013-06-27 Thread Tim Hockin
On Thu, Jun 27, 2013 at 10:38 AM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Wed, Jun 26, 2013 at 08:42:21PM -0700, Tim Hockin wrote:
>> OK, then what I don't know is what is the new interface?  A new cgroupfs?
>
> It's gonna be a new mount option for cgroupfs.
>
>> DTF and CPU and cpuset all have "default" groups for some tasks (and
>> not others) in our world today.  DTF actually has default, prio, and
>> "normal".  I was simplifying before.  I really wish it were as simple
>> as you think it is.  But if it were, do you think I'd still be
>> arguing?
>
> How am I supposed to know when you don't communicate it but just wave
> your hands saying it's all very complicated?  The cpuset / blkcg
> example is pretty bad because you can enforce any cpuset rules at the
> leaves.

Modifying hundreds of cgroups is really painful, and yes, we do it
often enough to be able to see it.

>> This really doesn't scale when I have thousands of jobs running.
>> Being able to disable at some levels on some controllers probably
>> helps some, but I can't say for sure without knowing the new interface
>
> How does the number of jobs affect it?  Does each job create a new
> cgroup?

Well, in your model it does...

>> We tried it in unified hierarchy.  We had our Top People on the
>> problem.  The best we could get was bad enough that we embarked on a
>> LITERAL 2 year transition to make it better.
>
> What didn't work?  What part was so bad?  I find it pretty difficult
> to believe that multiple orthogonal hierarchies is the only possible
> solution, so please elaborate the issues that you guys have
> experienced.

I'm looping in more Google people.

> The hierarchy is for organization and enforcement of dynamic
> hierarchical resource distribution and that's it.  If its expressive
> power is lacking, take compromise or tune the configuration according
> to the workloads.  The latter is necessary in workloads which have
> clear distinction of foreground and background anyway - anything which
> interacts with human beings including androids.

So what you're saying is that you don't care that this new thing is
less capable than the old thing, despite it having real impact.

>> In other words, define a container as a set of cgroups, one under each
>> each active controller type.  A TID enters the container atomically,
>> joining all of the cgroups or none of the cgroups.
>>
>> container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar,
>> /cgroup/io/default/foo/bar, /cgroup/cpuset/
>>
>> This is an abstraction that we maintain in userspace (more or less)
>> and we do actually have headaches from split hierarchies here
>> (handling partial failures, non-atomic joins, etc)
>
> That'd separate out task organization from controllre config
> hierarchies.  Kay had a similar idea some time ago.  I think it makes
> things even more complex than it is right now.  I'll continue on this
> below.
>
>> I'm still a bit fuzzy - is all of this written somewhere?
>
> If you dig through cgroup ML, most are there.  There'll be
> "cgroup.controllers" file with which you can enable / disable
> controllers.  Enabling a controller in a cgroup implies that the
> controller is enabled in all ancestors.

Implies or requires?  Cause or predicate?

If controller C is enabled at level X but disabled at level X/Y, does
that mean that X/Y uses the limits set in X?  How about X/Y/Z?

This will get rid of the bulk of the cpuset scaling problem, but not
all of it.  I think we still have the same problems with cpu as we do
with io.  Perhaps that should have been the example.

>> It sounds like you're missing a layer of abstraction.  Why not add the
>> abstraction you want to expose on top of powerful primitives, instead
>> of dumbing down the primitives?
>
> It sure would be possible build more and try to address the issues
> we're seeing now; however, after looking at cgroups for some time now,
> the underlying theme is failure to take reasonable trade-offs and
> going for maximum flexibility in making each choice - the choice of
> interface, multiple hierarchies, no restriction on hierarchical
> behavior, splitting threads of the same process into separate cgroups,
> semi-encouraging delegation through file permission without actually
> pondering the consequences and so on.  And each choice probably made
> sense trying to serve each immediate requirement at the time but added
> up it's a giant pile of mess which developed without direction.

I am very sympathetic to this problem.  You could have just described
some of our internal problems too.  The diff

Re: cgroup: status-quo and userland efforts

2013-06-27 Thread Tim Hockin
On Thu, Jun 27, 2013 at 11:14 AM, Serge Hallyn  wrote:
> Quoting Tejun Heo (t...@kernel.org):
>> Hello, Serge.
>>
>> On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote:
>> > At some point (probably soon) we might want to talk about a standard API
>> > for these things.  However I think it will have to come in the form of
>> > a standard library, which knows to either send requests over dbus to
>> > systemd, or over /dev/cgroup sock to the manager.
>>
>> Yeah, eventually, I think we'll have a standardized way to configure
>> resource distribution in the system.  Maybe we'll agree on a
>> standardized dbus protocol or there will be library, I don't know;
>> however, whatever form it may be in, it abstraction level should be
>> way higher than that of direct cgroupfs access.  It's way too low
>> level and very easy to end up in a complete nonsense configuration.
>>
>> e.g. enabling "cpu" on a cgroup whlie leaving other cgroups alone
>> wouldn't enable fair scheduling on that cgroup but drastically reduce
>> the amount of cpu share it gets as it now gets treated as single
>> entity competing with all tasks at the parent level.
>
> Right.  I *think* this can be offered as a daemon which sits as the
> sole consumer of my agent's API, and offers a higher level "do what I
> want" API.  But designing that API is going to be interesting.

This is something we have, partially, and are working to be able to
open-source.  We have a LOT of experience feeding into the semantics
that actually make users happy.

Today it leverages split-hierarchies, but that is not required in the
generic case (only if you want to offer the semantics we do).  It
explicitly delegates some aspects of sub-cgroup control to users, but
that could go away if your lowest-level agency can handle it.

> I should find a good, up-to-date summary of the current behaviors of
> each controller so I can talk more intelligently about it.  (I'll
> start by looking at the kernel Documentation/cgroups, but don't
> feel too confident that they'll be uptodate :)
>
>> At the moment, I'm not sure what the eventual abstraction would look
>> like.  systemd is extending its basic constructs by adding slices and
>> scopes and it does make sense to integrate the general organization of
>> the system (services, user sessions, VMs and so on) with resource
>> management.  Given some time, I'm hoping we'll be able to come up with
>> and agree on some common constructs so that each workload can indicate
>> its resource requirements in a unified way.
>>
>> That said, I really think we should experiment for a while before
>> trying to settle down on things.  We've now just started exploring how
>> system-wide resource managment can be made widely available to systems
>> without requiring extremely specialized hand-crafted configurations
>> and I'm pretty sure we're getting and gonna get quite a few details
>> wrong, so I don't think it'd be a good idea to try to agree on things
>> right now.  As far as such integration goes, I think it's time to play
>> with things and observe the results.
>
> Right,  I'm not attached to my toy implementation at all - except for
> the ability, in some fashion, to have nested agents which don't have
> cgroupfs access but talk to another agent to get the job done.
>
> -serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cgroup access daemon

2013-06-28 Thread Tim Hockin
On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn  wrote:
> Quoting Tim Hockin (thoc...@hockin.org):
>> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn  
>> wrote:
>> > Quoting Tim Hockin (thoc...@hockin.org):
>> >
>> >> For our use case this is a huge problem.  We have people who access
>> >> cgroup files in a fairly tight loops, polling for information.  We
>> >> have literally hundeds of jobs running on sub-second frequencies -
>> >> plumbing all of that through a daemon is going to be a disaster.
>> >> Either your daemon becomes a bottleneck, or we have to build something
>> >> far more scalable than you really want to.  Not to mention the
>> >> inefficiency of inserting a layer.
>> >
>> > Currently you can trivially create a container which has the
>> > container's cgroups bind-mounted to the expected places
>> > (/sys/fs/cgroup/$controller) by uncommenting two lines in the
>> > configuration file, and handle cgroups through cgroupfs there.
>> > (This is what the management agent wants to be an alternative
>> > for)  The main deficiency there is that /proc/self/cgroups is
>> > not filtered, so it will show /lxc/c1 for init's cgroup, while
>> > the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
>> > is seen under the container's /sys/fs/cgroup/devices (for
>> > instance).  Not ideal.
>>
>> I'm really saying that if your daemon is to provide a replacement for
>> cgroupfs direct access, it needs to be designed to be scalable.  If
>> we're going to get away from bind mounting cgroupfs into user
>> namespaces, then let's try to solve ALL the problems.
>>
>> >> We also need the ability to set up eventfds for users or to let them
>> >> poll() on the socket from this daemon.
>> >
>> > So you'd want to be able to request updates when any cgroup value
>> > is changed, right?
>>
>> Not necessarily ANY, but that's the terminus of this API facet.
>>
>> > That's currently not in my very limited set of commands, but I can
>> > certainly add it, and yes it would be a simple unix sock so you can
>> > set up eventfd, select/poll, etc.
>>
>> Assuming the protocol is basically a pass-through to basic filesystem
>> ops, it should be pretty easy.  You just need to add it to your
>> protocol.
>>
>> But it brings up another point - access control.  How do you decide
>> which files a child agent should have access to?  Does that ever
>> change based on the child's configuration? In our world, the answer is
>> almost certainly yes.
>
> Could you give examples?
>
> If you have a white/academic paper I should go read, that'd be great.

We don't have anything on this, but examples may help.

Someone running as root should be able to connect to the "native"
daemon and read or write any cgroup file they want, right?  You could
argue that root should be able to do this to a child-daemon, too, but
let's ignore that.

But inside a container, I don't want the users to be able to write to
anything in their own container.  I do want them to be able to make
sub-cgroups, but only 5 levels deep.  For sub-cgroups, they should be
able to write to memory.limit_in_bytes, to read but not write
memory.soft_limit_in_bytes, and not be able to read memory.stat.

To get even fancier, a user should be able to create a sub-cgroup and
then designate that sub-cgroup as "final" - no further sub-sub-cgroups
allowed under it.  They should also be able to designate that a
sub-cgroup is "one-way" - once a process enters it, it can not leave.

These are real(ish) examples based on what people want to do today.
In particular, the last couple are things that we want to do, but
don't do today.

The particular policy can differ per-container.  Production jobs might
be allowed to create sub-cgroups, but batch jobs are not.  Some user
jobs are designated "trusted" in one facet or another and get more
(but still not full) access.

> At the moment I'm going on the naive belief that proper hierarchy
> controls will be enforced (eventually) by the kernel - i.e. if
> a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
> won't be possible for /lxc/c1/lxc/c2 to take that access.
>
> The native cgroup manager (the one using cgroupfs) will be checking
> the credentials of the requesting child manager for access(2) to
> the cgroup files.

This might be sufficient, or the basis for a sufficient access control
system for users.  The problem comes that we have multiple jobs on a
single machine runn

Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin
On Thu, Jun 27, 2013 at 2:04 PM, Tejun Heo  wrote:
> Hello,
>
> On Thu, Jun 27, 2013 at 01:46:18PM -0700, Tim Hockin wrote:
>> So what you're saying is that you don't care that this new thing is
>> less capable than the old thing, despite it having real impact.
>
> Sort of.  I'm saying, at least up until now, moving away from
> orthogonal hierarchy support seems to be the right trade-off.  It all
> depends on how you measure how much things are simplified and how
> heavy the "real impacts" are.  It's not like these things can be
> determined white and black.  Given the current situation, I think it's
> the right call.

I totally understand where you're coming from - trying to get back to
a stable feature set.  But it sucks to be on the losing end of that
battle - you're cutting things that REALLY matter to us, and without a
really viable alternative.  So we'll keep fighting.

>> If controller C is enabled at level X but disabled at level X/Y, does
>> that mean that X/Y uses the limits set in X?  How about X/Y/Z?
>
> Y and Y/Z wouldn't make any difference.  Tasks belonging to them would
> behave as if they belong to X as far as C is concerened.

OK, that *sounds* sane.  It doesn't solve all our problems, but it
alleviates some of them.

>> So take away some of the flexibility that has minimal impact and
>> maximum return.  Splitting threads across cgroups - we use it, but we
>> could get off that.  Force all-or-nothing joining of an aggregate
>
> Please do so.

Splitting threads is sort of important for some cgroups, like CPU.  I
wonder if pjt is paying attention to this thread.

>> construct (a container vs N cgroups).
>>
>> But perform surgery with a scalpel, not a hatchet.
>
> As anything else, it's drawing a line in a continuous spectrum of
> grey.  Right now, given that maintaining multiple orthogonal
> hierarchies while introducing a proper concept of resource container
> involves addition of completely new constructs and complexity, I don't
> think that's a good option.  If there are problems which can't be
> resolved / worked around in a reasonable manner, please bring them up
> along with their contexts.  Let's examine them and see whether there
> are other ways to accomodate them.

You're arguing that the abstraction you want is that of a "container"
but that it's easier to remove options than to actually build a better
API.

I think this is wrong.  Take the opportunity to define the RIGHT
interface that you WANT - a container.  Implement it in terms of
cgroups (and maybe other stuff!).  Make that API so compelling that
people want to use it, and your war of attrition on direct cgroup
madness will be won, but with net progress rather than regress.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Workman-devel] cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin
On Fri, Jun 28, 2013 at 8:53 AM, Serge Hallyn  wrote:
> Quoting Daniel P. Berrange (berra...@redhat.com):

>> Are you also planning to actually write a new cgroup parent manager
>> daemon too ? Currently my plan for libvirt is to just talk directly
>
> I'm toying with the idea, yes.  (Right now my toy runs in either native
> mode, using cgroupfs, or child mode, talking to a parent manager)  I'd
> love if someone else does it, but it needs to be done.
>
> As I've said elsewhere in the thread, I see 2 problems to be addressed:
>
> 1. The ability to nest the cgroup manager daemons, so that a daemon
> running in a container can talk to a daemon running on the host.  This
> is the problem my current toy is aiming to address.  But the API it
> exports is just a thin layer over cgroupfs.
>
> 2. Abstract away the kernel/cgroupfs details so that userspace can
> explain its cgroup needs generically.  This is IIUC what systemd is
> addressing with slices and scopes.
>
> (2) is where I'd really like to have a well thought out, community
> designed API that everyone can agree on, and it might be worth getting
> together (with Tejun) at plumbers or something to lay something out.

We're also working on (2) (well, we HAVE it, but we're dis-integrating
it so we can hopefully publish more widely).  But our (2) depends on
direct cgroupfs access.  If that is to change, we need a really robust
(1).  It's OK (desireable, in fact) that (1) be a very thin layer of
abstraction.

> In the end, something like libvirt or lxc should not need to care
> what is running underneat it.  It should be able to make its requests
> the same way regardless of whether it running in fedora or ubuntu,
> and whether it is running on the host or in a tightly bound container.
> That's my goal anyway :)
>
>> to systemd's new DBus APIs for all management of cgroups, and then
>> fall back to writing to cgroupfs directly for cases where systemd
>> is not around.  Having a library to abstract these two possible
>> alternatives isn't all that compelling unless we think there will
>> be multiple cgroups manager daemons. I've been somewhat assuming that
>> even Ubuntu will eventually see the benefits & switch to systemd,
>
> So far I've seen no indication of that :)
>
> If the systemd code to manage slices could be made separately
> compileable as a standalone library or daemon, then I'd advocate
> using that.  But I don't see a lot of incentive for systemd to do
> that, so I'd feel like a heel even asking.

I want to say "let the best API win", but I know that systemd is a
giant katamari ball, and it's absorbing subsystems so it may win by
default.  That isn't going to stop us from trying to do what we do,
and share that with the world.

>> then the issue of multiple manager daemons wouldn't really exist.
>
> True.  But I'm running under the assumption that Ubuntu will stick with
> upstart, and therefore yes I'll need a separate (perhaps pair of)
> management daemons.
>
> Even if we were to switch to systemd, I'd like the API for userspace
> programs to configure and use cgroups to be as generic as possible,
> so that anyone who wanted to write their own daemon could do so.
>
> -serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin
On Fri, Jun 28, 2013 at 8:05 AM, Michal Hocko  wrote:
> On Thu 27-06-13 22:01:38, Tejun Heo wrote:

>> Oh, that in itself is not bad.  I mean, if you're root, it's pretty
>> easy to play with and that part is fine.  But combined with the
>> hierarchical nature of cgroup and file permissions, it encourages
>> people to "deligate" subdirectories to less previledged domains,
>
> OK, this really depends on what you expose to non-root users. I have
> seen use cases where admin prepares top-level which is root-only but
> it allows creating sub-groups which are under _full_ control of the
> subdomain. This worked nicely for memcg for example because hard limit,
> oom handling and other knobs are hierarchical so the subdomain cannot
> overwrite what admin has said.

bingo

> And the systemd, with its history of eating projects and not caring much
> about their previous users who are not willing to jump in to the systemd
> car, doesn't sound like a good place where to place the new interface to
> me.

+1

If systemd is the only upstream implementation of this single-agent
idea, we will have to invent our own, and continue to diverge rather
than converge.  I think that, if we are going to pursue this model of
a single-agent, we should make a kick-ass implementation that is
flexible and scalable, and full-featured enough to not require
divergence at the lowest layer of the stack.  Then build systemd on
top of that. Let systemd offer more features and policies and
"semantic" APIs.

We will build our own semantic APIs that are, necessarily, different
from systemd.  But we can all use the same low-level mechanism.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cgroup access daemon

2013-06-28 Thread Tim Hockin
On Fri, Jun 28, 2013 at 12:21 PM, Serge Hallyn  wrote:
> Quoting Tim Hockin (thoc...@hockin.org):
>> On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn  
>> wrote:
>> > Quoting Tim Hockin (thoc...@hockin.org):
>> >> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn  
>> >> wrote:
>> >> > Quoting Tim Hockin (thoc...@hockin.org):
>> > Could you give examples?
>> >
>> > If you have a white/academic paper I should go read, that'd be great.
>>
>> We don't have anything on this, but examples may help.
>>
>> Someone running as root should be able to connect to the "native"
>> daemon and read or write any cgroup file they want, right?  You could
>> argue that root should be able to do this to a child-daemon, too, but
>> let's ignore that.
>>
>> But inside a container, I don't want the users to be able to write to
>> anything in their own container.  I do want them to be able to make
>> sub-cgroups, but only 5 levels deep.  For sub-cgroups, they should be
>> able to write to memory.limit_in_bytes, to read but not write
>> memory.soft_limit_in_bytes, and not be able to read memory.stat.
>>
>> To get even fancier, a user should be able to create a sub-cgroup and
>> then designate that sub-cgroup as "final" - no further sub-sub-cgroups
>> allowed under it.  They should also be able to designate that a
>> sub-cgroup is "one-way" - once a process enters it, it can not leave.
>>
>> These are real(ish) examples based on what people want to do today.
>> In particular, the last couple are things that we want to do, but
>> don't do today.
>>
>> The particular policy can differ per-container.  Production jobs might
>> be allowed to create sub-cgroups, but batch jobs are not.  Some user
>> jobs are designated "trusted" in one facet or another and get more
>> (but still not full) access.
>
> Interesting, thanks.
>
> I'll think a bit on how to best address these.
>
>> > At the moment I'm going on the naive belief that proper hierarchy
>> > controls will be enforced (eventually) by the kernel - i.e. if
>> > a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
>> > won't be possible for /lxc/c1/lxc/c2 to take that access.
>> >
>> > The native cgroup manager (the one using cgroupfs) will be checking
>> > the credentials of the requesting child manager for access(2) to
>> > the cgroup files.
>>
>> This might be sufficient, or the basis for a sufficient access control
>> system for users.  The problem comes that we have multiple jobs on a
>> single machine running as the same user.  We need to ensure that the
>> jobs can not modify each other.
>
> Would running them each in user namespaces with different mappings (all
> jobs running as uid 1000, but uid 1000  mapped to different host uids
> for each job) would be (long-term) feasible?

Possibly.  It's a largish imposition to make on the caller (we don't
use user namespaces today, though we are evaluating how to start using
them) but perhaps not terrible.

>> > It is a named socket.
>>
>> So anyone can connect?  even with SO_PEERCRED, how do you know which
>> branches of the cgroup tree I am allowed to modify if the same user
>> owns more than one?
>
> I was assuming that any process requesting management of
> /c1/c2/c3 would have to be in one of its ancestor cgroups (i.e. /c1)
>
> So if you have two jobs running as uid 1000, one under /c1 and one
> under /c2, and one as uid 1001 running under /c3 (with the uids owning
> the cgroups), then the file permissions will prevent 1000 and 1001
> from walk over each other, while the cgroup manager will not allow
> a process (child manager or otherwise) under /c1 to manage cgroups
> under /c2 and vice versa.
>
>> >> Do you have a design spec, or a requirements list, or even a prototype
>> >> that we can look at?
>> >
>> > The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README
>> > shows what I have in mind.  It (and the sloppy code next to it)
>> > represent a few hours' work over the last few days while waiting
>> > for compiles and in between emails...
>>
>> Awesome.  Do you mind if we look?
>
> No, but it might not be worth it (other than the readme) :) - so far
> it's only served to help me think through what I want and need from
> the mgr.
>
>> > But again, it is completely predicated on my goal to have libvirt
>> > and lxc (and other cgroup users) be able to use the 

Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Tim Hockin
Come on, now, Lennart.  You put a lot of words in my mouth.

On Fri, Jun 28, 2013 at 6:48 PM, Lennart Poettering  wrote:
> On 28.06.2013 20:53, Tim Hockin wrote:
>
>> a single-agent, we should make a kick-ass implementation that is
>> flexible and scalable, and full-featured enough to not require
>> divergence at the lowest layer of the stack.  Then build systemd on
>> top of that. Let systemd offer more features and policies and
>> "semantic" APIs.
>
>
> Well, what if systemd is already kick-ass? I mean, if you have a problem
> with systemd, then that's your own problem, but I really don't think why I
> should bother?

I didn't say it wasn't.  I said that we can build a common substrate
that systemd can build on *and* non-systemd systems can use *and*
Google can participate in.

> I for sure am not going to make the PID 1 a client of another daemon. That's
> just wrong. If you have a daemon that is both conceptually the manager of
> another service and the client of that other service, then that's bad design
> and you will easily run into deadlocks and such. Just think about it: if you
> have some external daemon for managing cgroups, and you need cgroups for
> running external daemons, how are you going to start the external daemon for
> managing cgroups? Sure, you can hack around this, make that daemon special,
> and magic, and stuff -- or you can just not do such nonsense. There's no
> reason to repeat the fuckup that cgroup became in kernelspace a second time,
> but this time in userspace, with multiple manager daemons all with different
> and slightly incompatible definitions what a unit to manage actualy is...

I forgot about the tautology of systemd.  systemd is monolithic.
Therefore it can not have any external dependencies.  Therefore it
must absorb anything it depends on.  Therefore systemd continues to
grow in size and scope.  Up next: systemd manages your X sessions!

But that's not my point.  It seems pretty easy to make this cgroup
management (in "native mode") a library that can have either a thin
veneer of a main() function, while also being usable by systemd.  The
point is to solve all of the problems ONCE.  I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.

> We want to run fewer, simpler things on our systems, we want to reuse as

Fewer and simpler are not compatible, unless you are losing
functionality.  Systemd is fewer, but NOT simpler.

> much of the code as we can. You don't achieve that by running yet another
> daemon that does worse what systemd can anyway do simpler, easier and
> better.

Considering this is all hypothetical, I find this to be a funny
debate.  My hypothetical idea is better than your hypothetical idea.

> The least you could grant us is to have a look at the final APIs we will
> have to offer before you already imply that systemd cannot be a valid
> implementation of any API people could ever agree on.

Whoah, don't get defensive.  I said nothing of the sort.  The fact of
the matter is that we do not run systemd, at least in part because of
the monolithic nature.  That's unlikely to change in this timescale.
What I said was that it would be a shame if we had to invent our own
low-level cgroup daemon just because the "upstream" daemons was too
tightly coupled with systemd.

I think we have a lot of experience to offer to this project, and a
vested interest in seeing it done well.  But if it is purely
targetting systemd, we have little incentive to devote resources to
it.

Please note that I am strictly talking about the lowest layer of the
API.  Just the thing that guards cgroupfs against mere mortals.  The
higher layers - where abstractions exist, that are actually USEFUL to
end users - are not really in scope right now.  We already have our
own higher level APIs.

This is supposed to be collaborative, not combative.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cgroup: status-quo and userland efforts

2013-06-30 Thread Tim Hockin
On Sun, Jun 30, 2013 at 12:39 PM, Lennart Poettering
 wrote:
> Heya,
>
>
> On 29.06.2013 05:05, Tim Hockin wrote:
>>
>> Come on, now, Lennart.  You put a lot of words in my mouth.
>
>
>>> I for sure am not going to make the PID 1 a client of another daemon.
>>> That's
>>> just wrong. If you have a daemon that is both conceptually the manager of
>>> another service and the client of that other service, then that's bad
>>> design
>>> and you will easily run into deadlocks and such. Just think about it: if
>>> you
>>> have some external daemon for managing cgroups, and you need cgroups for
>>> running external daemons, how are you going to start the external daemon
>>> for
>>> managing cgroups? Sure, you can hack around this, make that daemon
>>> special,
>>> and magic, and stuff -- or you can just not do such nonsense. There's no
>>> reason to repeat the fuckup that cgroup became in kernelspace a second
>>> time,
>>> but this time in userspace, with multiple manager daemons all with
>>> different
>>> and slightly incompatible definitions what a unit to manage actualy is...
>>
>>
>> I forgot about the tautology of systemd.  systemd is monolithic.
>
>
> systemd is certainly not monolithic for almost any definition of that term.
> I am not sure where you are taking that from, and I am not sure I want to
> discuss on that level. This just sounds like FUD you picked up somewhere and
> are repeating carelessly...

It does a number of sort-of-related things.  Maybe it does them better
by doing them together.  I can't say, really.  We don't use it at
work, and I am on Ubuntu elsewhere, for now.

>> But that's not my point.  It seems pretty easy to make this cgroup
>> management (in "native mode") a library that can have either a thin
>> veneer of a main() function, while also being usable by systemd.  The
>> point is to solve all of the problems ONCE.  I'm trying to make the
>> case that systemd itself should be focusing on features and policies
>> and awesome APIs.
>
> You know, getting this all right isn't easy. If you want to do things
> properly, then you need to propagate attribute changes between the units you
> manage. You also need something like a scheduler, since a number of
> controllers can only be configured under certain external conditions (for
> example: the blkio or devices controller use major/minor parameters for
> configuring per-device limits. Since major/minor assignments are pretty much
> unpredictable these days -- and users probably want to configure things with
> friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to
> wait for devices to show up before we can configure the parameters.) Soo...
> you need a graph of units, where you can propagate things, and schedule
> things based on some execution/event queue. And the propagation and
> scheduling are closely intermingled.

I'm really just talking about the most basic low-level substrate of
writing to cgroupfs.  Again, we don't use udev (yet?) so we don't have
these problems.  It seems to me that it's possible to formulate a
bottom layer that is usable by both systemd and non-systemd systems.
But, you know, maybe I am wrong and our internal universe is so much
simpler (and behind the times) than the rest of the world that
layering can work for us and not you.

> Now, that's pretty much exactly what systemd actually *is*. It implements a
> graph of units with a scheduler. And if you rip that part out of systemd to
> make this an "easy cgroup management library", then you simply turn what
> systemd is into a library without leaving anything. Which is just bogus.
>
> So no, if you say "seems pretty easy to make this cgroup management a
> library" then well, I have to disagree with you.
>
>
>>> We want to run fewer, simpler things on our systems, we want to reuse as
>>
>>
>> Fewer and simpler are not compatible, unless you are losing
>> functionality.  Systemd is fewer, but NOT simpler.
>
>
> Oh, certainly it is. If we'd split up the cgroup fs access into separate
> daemon of some kind, then we'd need some kind of IPC for that, and so you
> have more daemons and you have some complex IPC between the processes. So
> yeah, the systemd approach is certainly both simpler and uses fewer daemons
> then your hypothetical one.

Well, it SOUNDS like Serge is trying to develop this to demonstrate
that a standalone daemon works.  That's what I am keen to help with
(or else we have to invent ourselves).  I am not really afraid of IPC
or of "m

Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin
A year later - what ever happened with this?  I want it more than ever
for Google's use.

On Tue, Jan 31, 2012 at 7:37 PM, Frederic Weisbecker  wrote:
> Hi,
>
> Changes In this version:
>
> - Split 32/64 bits version of res_counter_write_u64() [1/10]
>   Courtesy of Kirill A. Shutemov
>
> - Added Kirill's ack [8/10]
>
> - Added selftests [9/10], [10/10]
>
> Please consider for merging. At least two users want this feature:
> https://lkml.org/lkml/2011/12/13/309
> https://lkml.org/lkml/2011/12/13/364
>
> More general details provided in the last version posting:
> https://lkml.org/lkml/2012/1/13/230
>
> Thanks!
>
>
> Frederic Weisbecker (9):
>   cgroups: add res_counter_write_u64() API
>   cgroups: new resource counter inheritance API
>   cgroups: ability to stop res charge propagation on bounded ancestor
>   res_counter: allow charge failure pointer to be null
>   cgroups: pull up res counter charge failure interpretation to caller
>   cgroups: allow subsystems to cancel a fork
>   cgroups: Add a task counter subsystem
>   selftests: Enter each directories before executing selftests
>   selftests: Add a new task counter selftest
>
> Kirill A. Shutemov (1):
>   cgroups: add res counter common ancestor searching
>
>  Documentation/cgroups/resource_counter.txt |   20 ++-
>  Documentation/cgroups/task_counter.txt |  153 +++
>  include/linux/cgroup.h |   20 +-
>  include/linux/cgroup_subsys.h  |5 +
>  include/linux/res_counter.h|   27 ++-
>  init/Kconfig   |9 +
>  kernel/Makefile|1 +
>  kernel/cgroup.c|   23 ++-
>  kernel/cgroup_freezer.c|6 +-
>  kernel/cgroup_task_counter.c   |  272 
> 
>  kernel/exit.c  |2 +-
>  kernel/fork.c  |7 +-
>  kernel/res_counter.c   |  103 +++-
>  tools/testing/selftests/Makefile   |2 +-
>  tools/testing/selftests/run_tests  |6 +-
>  tools/testing/selftests/task_counter/Makefile  |8 +
>  tools/testing/selftests/task_counter/fork.c|   40 +++
>  tools/testing/selftests/task_counter/forkbomb.c|   40 +++
>  tools/testing/selftests/task_counter/multithread.c |   68 +
>  tools/testing/selftests/task_counter/run_test  |  198 ++
>  .../selftests/task_counter/spread_thread_group.c   |   82 ++
>  21 files changed, 1056 insertions(+), 36 deletions(-)
>  create mode 100644 Documentation/cgroups/task_counter.txt
>  create mode 100644 kernel/cgroup_task_counter.c
>  create mode 100644 tools/testing/selftests/task_counter/Makefile
>  create mode 100644 tools/testing/selftests/task_counter/fork.c
>  create mode 100644 tools/testing/selftests/task_counter/forkbomb.c
>  create mode 100644 tools/testing/selftests/task_counter/multithread.c
>  create mode 100755 tools/testing/selftests/task_counter/run_test
>  create mode 100644 tools/testing/selftests/task_counter/spread_thread_group.c
>
> --
> 1.7.5.4
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin
On Mon, Apr 1, 2013 at 11:46 AM, Tejun Heo  wrote:
> On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote:
>> A year later - what ever happened with this?  I want it more than ever
>> for Google's use.
>
> I think the conclusion was "use kmemcg instead".

Pardon my ignorance, but... what?  Use kernel memory limits as a proxy
for process/thread counts?  That sounds terrible - I hope I am
misunderstanding?  This task counter patch had several properties that
mapped very well to what we want.

Is it dead in the water?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin
On Mon, Apr 1, 2013 at 1:29 PM, Tejun Heo  wrote:
> On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote:
>> Pardon my ignorance, but... what?  Use kernel memory limits as a proxy
>> for process/thread counts?  That sounds terrible - I hope I am
>
> Well, the argument was that process / thread counts were a poor and
> unnecessary proxy for kernel memory consumption limit.  IIRC, Johannes
> put it as (I'm paraphrasing) "you can't go to Fry's and buy 4k thread
> worth of component".
>
>> misunderstanding?  This task counter patch had several properties that
>> mapped very well to what we want.
>>
>> Is it dead in the water?
>
> After some discussion, Frederic agreed that at least his use case can
> be served well by kmemcg, maybe even better - IIRC it was container
> fork bomb scenario, so you'll have to argue your way in why kmemcg
> isn't a suitable solution for your use case if you wanna revive this.

We run dozens of jobs from dozens users on a single machine.  We
regularly experience users who leak threads, running into the tens of
thousands.  We are unable to raise the PID_MAX significantly due to
some bad, but really thoroughly baked-in decisions that were made a
long time ago.  What we experience on a daily basis is users
complaining about getting a "pthread_create(): resource unavailable"
error because someone on the machine has leaked.

Today we use RLIMIT_NPROC to lock most users down to a smaller max.
But this is a per-user setting, not a per-container setting, and users
do not control where their jobs land.  Scheduling decisions often put
multiple thread-heavy but non-leaking jobs from one user onto the same
machine, which again causes problems.  Further, it does not help for
some of our use cases where a logical job can run as multiple UIDs for
different processes within.

>From the end-user point of view this is an isolation leak which is
totally non-deterministic for them.  They can not know what to plan
for.  Getting cgroup-level control of this limit is important for a
saner SLA for our users.

In addition, the behavior around locking-out new tasks seems like a
nice way to simplify and clean up end-life work for the administrative
system.  Admittedly, we can mostly work around this with freezer
instead.

What I really don't understand is why so much push back?  We have this
nicely structured cgroup system.  Each cgroup controller's code is
pretty well partitioned - why would we not want more complete
functionality built around it?  We accept device drivers for the most
random, useless crap on the assertion that "if you don't need it,
don't compile it in".  I can think of a half dozen more really useful,
cool things we can do with cgroups, but I know the pushback will be
tremendous, and I just don't grok why.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin
On Mon, Apr 1, 2013 at 3:03 PM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote:
>> We run dozens of jobs from dozens users on a single machine.  We
>> regularly experience users who leak threads, running into the tens of
>> thousands.  We are unable to raise the PID_MAX significantly due to
>> some bad, but really thoroughly baked-in decisions that were made a
>> long time ago.  What we experience on a daily basis is users
>
> U so that's why you guys can't use kernel memory limit? :(

Because it is completely non-obvious how to map between the two in a
way that is safe across kernel versions and not likely to blow up in
our faces.  It's a hack, in other words.

>> complaining about getting a "pthread_create(): resource unavailable"
>> error because someone on the machine has leaked.
> ...
>> What I really don't understand is why so much push back?  We have this
>> nicely structured cgroup system.  Each cgroup controller's code is
>> pretty well partitioned - why would we not want more complete
>> functionality built around it?  We accept device drivers for the most
>> random, useless crap on the assertion that "if you don't need it,
>> don't compile it in".  I can think of a half dozen more really useful,
>> cool things we can do with cgroups, but I know the pushback will be
>> tremendous, and I just don't grok why.
>
> In general, because it adds to maintenance overhead.  e.g. We've been
> trying to make all cgroups follow consistent nesting rules.  We're now
> almost there with a couple controllers left.  This one would have been
> another thing to do, which is fine if it's necessary but if it isn't
> we're just adding up work for no good reason.
>
> More importantly, because cgroup is already plagued with so many bad
> design decisions - some from core design decisions - e.g. not being
> able to actually identify a resource outside of a context of a task.
> Others are added on by each controller going out doing whatever it
> wants without thinking about how the whole thing would come together
> afterwards - e.g. double accounting between cpu and cpuacct,
> completely illogical and unusable hierarchy implementations in
> anything other than cpu controllers (they're getting better), and so
> on.  Right now it's in a state where there's not many things coherent
> about it.  Sure, every controller and feature supports the ones their
> makers intended them to but when collected together it's just a mess,
> which is one of the common complaints against cgroup.
>
> So, no free-for-all, please.
>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin
On Mon, Apr 1, 2013 at 3:35 PM, Tejun Heo  wrote:
> Hey,
>
> On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote:
>> > U so that's why you guys can't use kernel memory limit? :(
>>
>> Because it is completely non-obvious how to map between the two in a
>> way that is safe across kernel versions and not likely to blow up in
>> our faces.  It's a hack, in other words.
>
> Now we're repeating the argument Frederic and Johannes had, so I'd
> suggest going back the thread and reading the discussion and if you
> still think using kmemcg is a bad idea, please explain why that is so.
> For the specific point that you just raised, the scale tilted toward
> thread/process count is a hacky and unreliable representation of
> kernel memory resource than the other way around, at least back then.

I am not limited by kernel memory, I am limited by PIDs, and I need to
be able to manage them.  memory.kmem.usage_in_bytes seems to be far
too noisy to be useful for this purpose.  It may work fine for "just
stop a fork bomb" but not for any sort of finer-grained control.

> If you think you can tilt it the other way, please feel free to try.

Just because others caved, doesn't make it less of a hack.  And I will
cave, too, because I don't have time to bang my head against a wall,
especially when I can see the remnants of other people who have tried.

We'll work around it, or we'll hack around it, or we'll carry this
patch in our own tree and just grumble about ridiculous hacks every
time we have to forward port it.

I was just hoping that things had worked themselves out in the last year.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/10] cgroups: Task counter subsystem v8

2013-04-01 Thread Tim Hockin
On Mon, Apr 1, 2013 at 4:18 PM, Tejun Heo  wrote:
> Hello,
>
> On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote:
>> I am not limited by kernel memory, I am limited by PIDs, and I need to
>> be able to manage them.  memory.kmem.usage_in_bytes seems to be far
>> too noisy to be useful for this purpose.  It may work fine for "just
>> stop a fork bomb" but not for any sort of finer-grained control.
>
> So, why are you limited by PIDs other than the arcane / weird
> limitation that you have whereever that limitation is?

Does anyone anywhere actually set PID_MAX > 64K?  As far as I can
tell, distros default it to 32K or 64K because there's a lot of stuff
out there that assumes this to be true.  This is the problem we have -
deep down in the bowels of code that is taking literally years to
overhaul, we have identified a bad assumption that PIDs are always 5
characters long.  I can't fix it any faster.  That said, we also
identified other software that make similar assumptions, though they
are less critical to us.

>> > If you think you can tilt it the other way, please feel free to try.
>>
>> Just because others caved, doesn't make it less of a hack.  And I will
>> cave, too, because I don't have time to bang my head against a wall,
>> especially when I can see the remnants of other people who have tried.
>>
>> We'll work around it, or we'll hack around it, or we'll carry this
>> patch in our own tree and just grumble about ridiculous hacks every
>> time we have to forward port it.
>>
>> I was just hoping that things had worked themselves out in the last year.
>
> It's kinda weird getting this response, as I don't think it has been
> particularly walley.  The arguments were pretty sound from what I
> recall and Frederic's use case was actually better covered by kmemcg,
> so where's the said wall?  And I asked you why your use case is
> different and the only reason you gave me is some arbitrary PID
> limitation on whatever thing you're using, which you gotta agree is a
> pretty hard sell.  So, if you think you have a valid case, please just
> explain it.  Why go passive agressive on it?  If you don't have a
> valid case for pushing it, yes, you'll have to hack around it - carry
> the patches in your tree, whatever, or better, fix the weird PID
> problem.

Sorry Tejun, you're being very reasonable, I was not.  The history of
this patch is what makes me frustrated.  It seems like such an obvious
thing to support that it blows my mind that people argue it.

You know our environment.  Users can use their memory budgets however
they like - kernel or userspace.  We have good accounting, but we are
PID limited.  We've even implemented some hacks of our own to make
that hurt less because the previously-mentioned assumptions are just
NOT going away any time soon.  I literally have user bugs every week
on this.  Hopefully the hacks we have put in place will make the users
stop hurting.  But we're left with some residual problems, some of
which are because the only limits we can apply are per-user rather
than per-container.

>From our POV building the cluster, cgroups are strictly superior to
most other control interfaces because they work at the same
granularity that we do.  I want more things to support cgroup control.
 This particular one was double-tasty because the ability to set the
limit to 0 would actually solve a different problem we have in
teardown.  But as I said, we can mostly work around that.

So I am frustrated because I don't think my use case will convince you
(at the root of it, it is a problem of our own making, but it LONG
predates me), despite my belief that it is obviously a good feature.
I find myself hoping that someone else comes along and says "me too"
rather than using a totally different hack for this.

Oh well.  Thanks for the update.  Off to do our own thing again.


> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


66MHz PCI

2001-04-03 Thread Tim Hockin

All,

is it possible to detect whether a device is running at 66MHz (as opposed
to 33)?  PCI defines a 66MHz capable bit, but not a 66MHz enabled bit.  We
have a silly device that seems to need to know what it's bus speed is, but
have no way to tell from software (that I know of).

So, pray tell -- is there a way to figure it out?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Process pinning

2001-04-17 Thread Tim Hockin

> disallowed CPU on which it is already running.  And even a non-RT
> process will stick on its disallowed CPU as long as nothing else runs
> there.

are we going to keep the cpus_allowed API?  If we want the (IMHO) more
flexible sysmp() API - I'll finish the 2.4 port.  If we are going to keep
cpus_allowed - I'll just abandon pset and sysmp.

Personally, I like sysmp() and the pset tools better, perhaps with a /proc
extension to it.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-18 Thread Tim Hockin

> > use vanilla 2.4.x, you can simply copy drivers/net/starfire.c from the -ac
> > tree.
> 
> I can't use 2.4 kernels ATM because they don't boot (at all) on Cobalt
> hardware for some reason - when I've got chance I'll look into it and try
> and fix the 2.4 kernels so they work on Cobalt kit, but ATM it's fairly
> low on my todo list ...

ftp://ftp.cobaltnet.com/pub/users/thockin/2.4
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: SMP: bind process to cpu

2001-02-17 Thread Tim Hockin

> Is it possible to bind a process to a specific
> cpu on this SMP machine (process affinity) ?
> 
> I there something like pset ?

http://isunix.it.ilstu.edu/~thockin/pset  - pset for linux-2.2 (not ported
to 2.4 yet)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



mtrr message

2001-02-24 Thread Tim Hockin

I'm noticing these messages:

mtrr: base(0xd400) is not aligned on a size(0x180) boundary

:many times in dmesg.  System is a dual P3-933 on a MSI 694D board (Apollo
Pro 133).

Is it worrisome?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: IDE not fully found (2.4.2) PDC20265

2001-02-25 Thread Tim Hockin

> On Mon, 15 Jan 2001, Tim Hockin wrote:
> 
> > Motherboard (MSI 694D-AR) has Via Apollo Pro chipset, those IDE drives seem
> > fine.  Board also has a promise PDC20265  RAID/ATA100 controller.  On each
> > channel of this controller I have an IBM 45 GB ATA100 drive as master.
> > (hde and hdg).  BIOS sees these drives fine.  Linux only see hde and never
> > hdg (ide[012] but not ide3).  I thought I'd post it here, in case anyone

So I have a clue - pci-ide.c is looking at a PCI register to determine if
ide channels are enabled.  It seems that the BIOS on this board is not
enabling the second channel of the promise controller in this register.
There are other "enabled" bits, apparently.  pdc202xx.c checks some IO
registers from one of the base addresses to determine status.

Unfortunately, the enabled bit of this register seems to be
write-protected.  There must be an unlock bit in another register.  Anyone
have a datasheet for a pdc202x?  How do I unprotect PCI reg 0x50 

If I bypass the test for the enabled bit in ide-pci.c, I get all my drives
properly.  

Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



List down or am I unsubscribed?

2001-01-09 Thread Tim Hockin

I haven't received messages in a few days at least...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



IDE not fully found (2.4.0)

2001-01-15 Thread Tim Hockin

Just built a new system with Linux-2.4.0.

Motherboard (MSI 694D-AR) has Via Apollo Pro chipset, those IDE drives seem
fine.  Board also has a promise PDC20265  RAID/ATA100 controller.  On each
channel of this controller I have an IBM 45 GB ATA100 drive as master.
(hde and hdg?).  BIOS sees these drives fine.  Linux only see hde and never
hdg (ide[012] but not ide3).  I thought I'd post it here, in case anyone
else knew the answer right away.  

Second question:  Does the RAID functionality of this device work under
linux?  If so, is it better than LVM or MD?

boot snippet:
---
VP_IDE: IDE controller on PCI bus 00 dev 39
VP_IDE: chipset revision 16
VP_IDE: not 100% native mode: will probe irqs later
VP_IDE: VIA vt82c686a IDE UDMA66 controller on pci0:7.1
ide0: BM-DMA at 0xc000-0xc007, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xc008-0xc00f, BIOS settings: hdc:DMA, hdd:pio
PDC20265: IDE controller on PCI bus 00 dev 60
PDC20265: chipset revision 2
PDC20265: not 100% native mode: will probe irqs later
PDC20265: (U)DMA Burst Bit ENABLED Primary MASTER Mode Secondary MASTER
Mode.
ide2: BM-DMA at 0xdc00-0xdc07, BIOS settings: hde:pio, hdf:pio
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: int. assignment on SMP + ServerWorks chipset

2001-01-15 Thread Tim Hockin

> And if anybody else understands pirq routing, speak up. It's a black art.
> 

I have some experience with PIRQ and Serverworks, but I missed the first
bit of this discussion - can someone catch me up?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] PCI-Devices and ServerWorks chipset

2001-01-22 Thread Tim Hockin

> > patch is wrong -- it doesn't make any sense to scan a bus _range_. The registers
> > 0x44 and 0x45 are probably ID's of two primary buses and the code should scan
> > both of them, but not the space between them.
> 

0x44 is the primary bus number of the host bridge, and 0x45 is the
subordinate bus number for the bridge.  Just like a PCI-PCI bridge, but
different :)  Since there are two CNB30 functions, each has unique values
for this.  The primary bus of the second bridge must be the subordinate bus
of the first bridge + 1.  PRIMARY(1) = SUBORDINATE(0) + 1;

Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Can EINTR be handled the way BSD handles it? -- a plea from a

2000-11-06 Thread Tim Hockin

>   "THIS CHANGE CAUSED PROBLEMS FOR SOME APPLICATION CODE."
> 
>   _Which_ applications?
> 
>   _Why_ did they have a problem?  Was this due to a bug or were they
>   designed to do stuff this way?

I, for one, have an app that relies on syscalls not being restarted:
app goes into a blocking read on a socket
signal interrupts blocking read
signal handler sets a global flag
read returns interrupted
flag is checked - action may be taken

Now, this could be redone to use a select() with a small timeout (busy
loop), but the blocking read is more convenient, as there may be many
instances of this app running.  

>   How hard would it be to change these programs to use
>   sigaction() to enable the EINTR behavior?  We've got the source

Actually, you already have to specifically call sigaction to get the EINTR
behavior - glibc signal() (obsolete) sets SA_RESTART by default.  What this
suggests is that something in pthreads is unsetting SA_RESTART, or calling
sigaction without it.  grep?

> Moreover, I'm willing to bet money that a large percentage of user-land
> programmers aren't even _aware_ of the EINTR issue.  The Netscape people
> certainly weren't for Netscape 4.x.  Don't know if the Mozilla people are.

Then they have never read a decent UNIX programming text.  Page 275 of
"Advanced Programming in the UNIX Environment", W. Richard Stevens (a de
facto standard for texts in this area).  An entire section of the chapter
is spent on interrupted syscalls.  If I were the manager of a team of
people who didn't handle it, I'd be very disappointed.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: National Semiconductor DP83815 ethernet driver?

2000-12-12 Thread Tim Hockin

> >From searching Google, I know some sort of driver exists. In July, Adam J.
> Richter ([EMAIL PROTECTED]) posted a 2.2.16 driver he obtained from Dave
> Gotwisner at Wyse Technologies. And Tim Hockin mentioned that he was using
> an NSC driver, but had made some minor modifications.

We're still using a heavily hacked version of the NSC driver on 2.2.x.
When we do 2.4.x, I'll examine the other driver more closely.  I can send
you my hacks on the NSC version, if you need.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: how mmap() works?

2001-04-01 Thread Tim Hockin

> Without syncing, Linux writes whenever it thinks it's appropriate, e.g.
> when pages have to be freed (I think also when the bdflush writes back
> data, i.e. every 30 seconds by default).

what about mmap() on non-filesystem files (/dev/mem, /proc/bus/pci...) ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PCI IRQ routing problem in 2.4.0

2001-01-28 Thread Tim Hockin

> > Device 00:01.0 (slot 0): ISA bridge
> >   INTA: link 0x01, irq mask 0x1eb8 [3,4,5,7,9,10,11,12]
> >   INTB: link 0x02, irq mask 0x1eb8 [3,4,5,7,9,10,11,12]
> >   INTC: link 0x03, irq mask 0x1eb8 [3,4,5,7,9,10,11,12]
> >   INTD: link 0x04, irq mask 0x1eb8 [3,4,5,7,9,10,11,12]
> 
> Your "link" values are in the range 1-4. Which makes perfect sense, but
> that's absolutely _not_ what the Linux SiS routing code expects (the code 
> seems to expect them to be ASCII 'A' - 'D').


In reading the PIRQ specs, and making it work for our board, I thought
about this.  PIRQ states that link is chipset-dependant.  No chipset that I
have seen specifies what link should be.  So, as this case demonstrates, it
may be 'A' - the value the chipset expects, or 1, the logical index.
Either one makes sense, assuming the PIRQ routing code knows what link
means.  Here we see two BIOS vendors/versions that apparently do it
differently for the same chipset.Grrr.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC/Patch 2.6.11] Take control of PCI Master Abort Mode

2005-04-14 Thread Tim Hockin
On 4/13/05, Dave Jones <[EMAIL PROTECTED]> wrote:

> If we have a situation where we screw a subset of users with the
> config option =y and a different subset with =n, how is this improving
> the situation any over what we have today ?

Dave,

What's a good alternative?  Do we need to keep a whitelist of hardware
that is known to work?  A blacklist is pretty risky, since this is a very
hard problem to find.

What if it was always on, except when the commandlien was passed
(eliminate the CONFIG option)?  Really 'leet hacks could tweak a #define
if they don't like the command line option..

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] 66MHz PCI flag from commandline

2001-05-31 Thread Tim Hockin

Martin,

We spoke a while back about a pcispeed= command line param to set the PCI
busspeed values (for later querying, if needed).  Attached is my patch to
implement the feature we agreed upon.  It is against linux-2.4.5.

Below is our previous discussion, as a refresher :).  Please let me know if
this is not suitable for general inclusion in the kernel, and I'll try to
make it so.

Tim

(cc: lkml, alan)


Martin Mares wrote:
> > What do you think of my   pcispeed=0:33,2:66 idea?
 
> anything -- the 33/66 MHz values from the PCI specs are only upper limits),
> I'll welcome this option, but otherwise I'd rather like to use the measuring
> code in IDE driver as it requires no user intervention to get the right
> timing.

-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

diff -ruN dist-2.4.5/drivers/pci/pci.c cobalt-2.4.5/drivers/pci/pci.c
--- dist-2.4.5/drivers/pci/pci.cSat May 19 17:43:06 2001
+++ cobalt-2.4.5/drivers/pci/pci.c  Thu May 31 14:32:33 2001
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include /* for hotplug_path */
 #include 
 
@@ -37,6 +38,8 @@
 LIST_HEAD(pci_root_buses);
 LIST_HEAD(pci_devices);
 
+static int get_bus_speed(struct pci_bus *bus);
+
 /**
  * pci_find_slot - locate PCI device from a given PCI slot
  * @bus: number of PCI bus on which desired PCI device resides
@@ -928,6 +931,7 @@
child->number = child->secondary = busnr;
child->primary = parent->secondary;
child->subordinate = 0xff;
+   child->bus_speed = get_bus_speed(child);
 
/* Set up default resource pointers.. */
for (i = 0; i < 4; i++)
@@ -1110,8 +1114,19 @@
return NULL;
 
/* some broken boards return 0 or ~0 if a slot is empty: */
-   if (l == 0x || l == 0x || l == 0x || l == 0x)
+   if (l == 0x || l == 0x 
+|| l == 0x || l == 0x) {
+   /*
+* host/pci and pci/pci bridges will set Received Master Abort
+* (bit 13) on failed configuration access (happens when
+* searching for devices).  To be safe, clear the status
+* register.
+*/
+   unsigned short st;
+   pci_read_config_word(temp, PCI_STATUS, &st);
+   pci_write_config_word(temp, PCI_STATUS, st);
return NULL;
+   }
 
dev = kmalloc(sizeof(*dev), GFP_KERNEL);
if (!dev)
@@ -1239,6 +1254,7 @@
list_add_tail(&b->node, &pci_root_buses);
 
b->number = b->secondary = bus;
+   b->bus_speed = get_bus_speed(b);
b->resource[0] = &ioport_resource;
b->resource[1] = &iomem_resource;
return b;
@@ -1739,7 +1755,66 @@
return 1;
 }
 
+#define MAX_OVERRIDES 256
+static int pci_speed_overrides[MAX_OVERRIDES] __initdata;
+
+static int __init get_bus_speed(struct pci_bus *bus)
+{
+   if (!bus) {
+   return -1;
+   }
+
+   if (pci_speed_overrides[bus->number]) {
+   return pci_speed_overrides[bus->number];
+   } else {
+   /* printk("PCI: assuming 33 MHz for bus %d\n", bus->number); */
+   return 33;
+   }
+}
+
+/* handle pcispeed=0:33,1:66 parameter (speed=0 means unknown) */
+static int __init pci_speed_setup(char *str)
+{
+while (str) {
+char *k = strchr(str, ',');
+if (k) {
+*k++ = '\0';
+   }
+
+if (*str) {
+int bus;
+int speed;
+char *endp;
+
+   if (!isdigit(*str)) {
+   printk("PCI: bad bus number for "
+   "pcispeed parameter\n");
+   str = k;
+   continue;
+   }
+bus = simple_strtoul(str, &endp, 0);
+
+if (!*endp || !isdigit(*(++endp))) {
+   printk("PCI: bad speed for "
+   "pcispeed parameter\n");
+   str = k;
+   continue;
+   }
+   speed = simple_strtoul(endp, NULL, 0);
+   pci_speed_overrides[bus] = speed;
+   printk("PCI: setting bus %d speed to %d MHz\n",
+   bus, speed);
+
+   str = k;
+   } else {
+   break;
+   }
+   }
+   return 1;
+}
+
 __setup("pci=", pci_setup);
+__setup("pcispeed=&

[PATCH] new PCI ids

2001-05-31 Thread Tim Hockin

Attached is a patch for cleaning up some PCI ids and adding a few that were
missing.  Please let me know of any problems with this.

(diff against 2.4.5)

Tim

-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

diff -ruN dist-2.4.5/drivers/pci/pci.ids cobalt-2.4.5/drivers/pci/pci.ids
--- dist-2.4.5/drivers/pci/pci.ids  Sat May 19 17:49:14 2001
+++ cobalt-2.4.5/drivers/pci/pci.idsThu May 31 14:32:33 2001
@@ -4,7 +4,7 @@
 #  Maintained by Martin Mares <[EMAIL PROTECTED]>
 #  If you have any new entries, send them to the maintainer.
 #
-#  $Id: pci.ids,v 1.62 2000/06/28 10:56:36 mj Exp $
+#  $Id: pci.ids,v 1.3 2001/04/04 20:40:25 thockin Exp $
 #
 
 # Vendors, devices and subsystems. Please keep sorted.
@@ -244,6 +244,7 @@
000f  OHCI Compliant FireWire Controller
0011  National PCI System I/O
0012  USB Controller
+   0020  DP83815 (MacPhyter) Ethernet Controller
d001  87410 IDE
 100c  Tseng Labs Inc
3202  ET4000/W32p rev A
@@ -1925,9 +1926,9 @@
1102 8051  CT4850 SBLive! Value
7002  SB Live!
1102 0020  Gameport Joystick
-1103  Triones Technologies, Inc.
-   0003  HPT343
-   0004  HPT366
+1103  HighPoint Technologies, Inc.
+   0003  HPT343 UltraDMA 33 IDE Controller
+   0004  HPT366/370 UltraDMA 66/100 IDE Controller
 1104  RasterOps Corp.
 1105  Sigma Designs, Inc.
8300  REALmagic Hollywood Plus DVD Decoder
@@ -2335,13 +2336,16 @@
 1165  Imagraph Corporation
0001  Motion TPEG Recorder/Player with audio
 1166  ServerWorks
-   0007  CNB20-LE CPU to PCI Bridge
-   0008  CNB20HE
-   0009  CNB20LE
+   0007  CNB20-LE Host Bridge
+   0008  CNB20HE Host Bridge
+   0009  CNB20LE Host Bridge
0010  CIOB30
0011  CMIC-HE
-   0200  OSB4
-   0201  CSB5
+   0200  OSB4 South Bridge
+   0201  CSB5 South Bridge
+   0211  OSB4 IDE Controller
+   0212  CSB5 IDE Controller
+   0220  OSB4/CSB5 OHCI USB Controller
 1167  Mutoh Industries Inc
 1168  Thine Electronics Inc
 1169  Centre for Development of Advanced Computing
diff -ruN dist-2.4.5/include/linux/pci_ids.h cobalt-2.4.5/include/linux/pci_ids.h
--- dist-2.4.5/include/linux/pci_ids.h  Wed May 16 10:25:39 2001
+++ cobalt-2.4.5/include/linux/pci_ids.hThu May 31 14:33:17 2001
@@ -991,10 +991,12 @@
 #define PCI_DEVICE_ID_SERVERWORKS_LE   0x0009
 #define PCI_DEVICE_ID_SERVERWORKS_CIOB30   0x0010
 #define PCI_DEVICE_ID_SERVERWORKS_CMIC_HE  0x0011
-#define PCI_DEVICE_ID_SERVERWORKS_CSB50x0201
 #define PCI_DEVICE_ID_SERVERWORKS_OSB4 0x0200
+#define PCI_DEVICE_ID_SERVERWORKS_CSB5   0x0201
 #define PCI_DEVICE_ID_SERVERWORKS_OSB4IDE 0x0211
+#define PCI_DEVICE_ID_SERVERWORKS_CSB5IDE 0x0212
 #define PCI_DEVICE_ID_SERVERWORKS_OSB4USB 0x0220
+#define PCI_DEVICE_ID_SERVERWORKS_CSB5USB PCI_DEVICE_ID_SERVERWORKS_OSB4USB
 
 #define PCI_VENDOR_ID_SBE  0x1176
 #define PCI_DEVICE_ID_SBE_WANXL100 0x0301



[PATCH] IDE GET/SET_BUSSTATE ioctls

2001-05-31 Thread Tim Hockin

Andre,

We spoke a while back about a GET/SET BUSSTATE API for IDE.  Attached is my
(very simple) patch adding 2 ioctls, and obsoleting 1.  I will send the
implementation of this for the HPT370 in a different message.  Please let
me know if there are any problems with this preventing general inclusion.

This patch also includes support for a configurable 'max failures'
parameter, and one change for better DMA error reporting.

Tim


Andre Hedrick wrote:
> 
> Bring it on! ;-)
> 
> On Tue, 27 Mar 2001, Tim Hockin wrote:
> 
> > Andre,
> >
> > I'm doing some work toward hotswap IDE, and I had a query for you.  On
> > 2.2.x we added a HDIO_GET_BUSSTATE and HDIO_SET_BUSSTATE ioctl() pair.  Now
> > I see in 2.4 that there is an HDIO_TRISTATE_HWIF ioctl(), but no way to
> > un-tristate or query the status.
> >
> > Are there plans to add the converse APIs?  I see no one has yet implemented
> > the HWIF_TRISTATE_BUS ioctl() - would you accept my patch to
> > implement the HDIO_{GET,SET}_BUSSTATE, and implementation of it on the
> > HPT366 driver?


-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

diff -ruN dist-2.4.5/include/linux/hdreg.h cobalt-2.4.5/include/linux/hdreg.h
--- dist-2.4.5/include/linux/hdreg.hFri May 25 18:01:27 2001
+++ cobalt-2.4.5/include/linux/hdreg.h  Thu May 31 14:33:16 2001
@@ -181,9 +181,10 @@
 #define HDIO_GET_DMA   0x030b  /* get use-dma flag */
 #define HDIO_GET_NICE  0x030c  /* get nice flags */
 #define HDIO_GET_IDENTITY  0x030d  /* get IDE identification info */
+#define HDIO_GET_BUSSTATE  0x030e  /* get the bus state of the hwif */
 
 #define HDIO_DRIVE_RESET   0x031c  /* execute a device reset */
-#define HDIO_TRISTATE_HWIF 0x031d  /* execute a channel tristate */
+#define HDIO_TRISTATE_HWIF 0x031d  /* OBSOLETE - use SET_BUSSTATE */
 #define HDIO_DRIVE_TASK0x031e  /* execute task and special drive 
command */
 #define HDIO_DRIVE_CMD 0x031f  /* execute a special drive command */
 
@@ -200,6 +201,14 @@
 #define HDIO_SCAN_HWIF 0x0328  /* register and (re)scan interface */
 #define HDIO_SET_NICE  0x0329  /* set nice flags */
 #define HDIO_UNREGISTER_HWIF   0x032a  /* unregister interface */
+#define HDIO_SET_BUSSTATE  0x032b  /* set the bus state of the hwif */
+
+/* bus states */
+enum {
+   BUSSTATE_OFF = 0,
+   BUSSTATE_ON,
+   BUSSTATE_TRISTATE
+};
 
 /* BIG GEOMETRY */
 struct hd_big_geometry {
diff -ruN dist-2.4.5/include/linux/ide.h cobalt-2.4.5/include/linux/ide.h
--- dist-2.4.5/include/linux/ide.h  Fri May 25 18:02:42 2001
+++ cobalt-2.4.5/include/linux/ide.hThu May 31 14:33:16 2001
@@ -349,6 +350,8 @@
byteinit_speed; /* transfer rate set at boot */
bytecurrent_speed;  /* current transfer rate set */
bytedn; /* now wide spread use */
+   unsigned intfailures;   /* current failure count */
+   unsigned intmax_failures;   /* maximum allowed failure count */
 } ide_drive_t;
 
 /*
@@ -397,6 +400,11 @@
 typedef void (ide_rw_proc_t) (ide_drive_t *, ide_dma_action_t);
 
 /*
+ * ide soft-power support
+ */
+typedef int (ide_busproc_t) (struct hwif_s *, int);
+
+/*
  * hwif_chipset_t is used to keep track of the specific hardware
  * chipset used by each IDE interface, if known.
  */
@@ -467,6 +475,8 @@
 #endif
bytestraight8;  /* Alan's straight 8 check */
void*hwif_data; /* extra hwif data */
+   ide_busproc_t   *busproc;   /* driver soft-power interface */
+   bytebus_state;  /* power state of the IDE bus */
 } ide_hwif_t;
 
 
diff -ruN dist-2.4.5/drivers/ide/ide.c cobalt-2.4.5/drivers/ide/ide.c
--- dist-2.4.5/drivers/ide/ide.cTue May  1 16:05:00 2001
+++ cobalt-2.4.5/drivers/ide/ide.c  Thu May 31 14:32:16 2001
@@ -161,6 +161,9 @@
 #include 
 #endif /* CONFIG_KMOD */
 
+/* default maximum number of failures */
+#define IDE_DEFAULT_MAX_FAILURES   1
+
 static const byte ide_hwif_to_major[] = { IDE0_MAJOR, IDE1_MAJOR, IDE2_MAJOR, 
IDE3_MAJOR, IDE4_MAJOR, IDE5_MAJOR, IDE6_MAJOR, IDE7_MAJOR, IDE8_MAJOR, IDE9_MAJOR };
 
 static int idebus_parameter; /* holds the "idebus=" parameter */
@@ -248,6 +251,7 @@
hwif->name[1]   = 'd';
hwif->name[2]   = 'e';
hwif->name[3]   = '0' + index;
+   hwif->bus_state = BUSSTATE_ON;
for (unit = 0; unit < MAX_DRIVES; ++unit) {
ide_drive_t *drive = &hwif->drives[unit];
 
@@ -262,6 +266,7 @@
drive->name[0]  = 'h';
drive->name[1]  = 'd';
drive->name[2]  = &#x

[PATCH] sym53c8xx timer and smp fixes

2001-05-31 Thread Tim Hockin

All,

Attached is a patch for sym53c8xx.c to handle the error timer better, and
be more proper for SMP.  The changes are very simple, and have been beaten
on by us.  Please let me know if there are any problems accepting this
patch for general inclusion.

Tim 
-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

diff -ruN dist-2.4.5/drivers/scsi/sym53c8xx.c cobalt-2.4.5/drivers/scsi/sym53c8xx.c
--- dist-2.4.5/drivers/scsi/sym53c8xx.c Fri Apr 27 13:59:19 2001
+++ cobalt-2.4.5/drivers/scsi/sym53c8xx.c   Thu May 31 14:32:43 2001
@@ -634,8 +636,11 @@
 #if LINUX_VERSION_CODE >= LinuxVersionCode(2,1,93)
 
 spinlock_t sym53c8xx_lock = SPIN_LOCK_UNLOCKED;
+spinlock_t sym53c8xx_host_lock = SPIN_LOCK_UNLOCKED;
 #defineNCR_LOCK_DRIVER(flags) spin_lock_irqsave(&sym53c8xx_lock, flags)
 #defineNCR_UNLOCK_DRIVER(flags)   
spin_unlock_irqrestore(&sym53c8xx_lock,flags)
+#defineNCR_LOCK_HOSTS(flags) spin_lock_irqsave(&sym53c8xx_host_lock, 
+flags)
+#defineNCR_UNLOCK_HOSTS(flags)   
+spin_unlock_irqrestore(&sym53c8xx_host_lock,flags)
 
 #define NCR_INIT_LOCK_NCB(np)  spin_lock_init(&np->smp_lock);
 #defineNCR_LOCK_NCB(np, flags)spin_lock_irqsave(&np->smp_lock, flags)
@@ -650,6 +655,8 @@
 
 #defineNCR_LOCK_DRIVER(flags) do { save_flags(flags); cli(); } while (0)
 #defineNCR_UNLOCK_DRIVER(flags)   do { restore_flags(flags); } while (0)
+#defineNCR_LOCK_HOSTS(flags) do { save_flags(flags); cli(); } while (0)
+#defineNCR_UNLOCK_HOSTS(flags)   do { restore_flags(flags); } while (0)
 
 #defineNCR_INIT_LOCK_NCB(np)  do { } while (0)
 #defineNCR_LOCK_NCB(np, flags)do { save_flags(flags); cli(); } while (0)
@@ -695,7 +702,7 @@
return page_remapped? (page_remapped + page_offs) : 0UL;
 }
 
-static void __init unmap_pci_mem(u_long vaddr, u_long size)
+static void unmap_pci_mem(u_long vaddr, u_long size)
 {
if (vaddr)
iounmap((void *) (vaddr & PAGE_MASK));
@@ -2249,7 +2265,6 @@
**
*/
struct usrcmd   user;   /* Command from user*/
-   volatile u_char release_stage;  /* Synchronisation stage on release  */
 
/*
**  Fields that are used (primarily) for integrity check
@@ -5868,7 +5883,12 @@
**  start the timeout daemon
*/
np->lasttime=0;
-   ncr_timeout (np);
+#ifdef SCSI_NCR_PCIQ_BROKEN_INTR
+   np->timer.expires = ktime_get((HZ+9)/10);
+#else
+   np->timer.expires = ktime_get(SCSI_NCR_TIMER_INTERVAL);
+#endif
+   add_timer(&np->timer);
 
/*
**  use SIMPLE TAG messages by default
@@ -7227,23 +7247,19 @@
 **==
 */
 
-#ifdef MODULE
 static int ncr_detach(ncb_p np)
 {
-   int i;
+   unsigned long flags;
 
printk("%s: detaching ...\n", ncr_name(np));
 
 /*
-** Stop the ncr_timeout process
-** Set release_stage to 1 and wait that ncr_timeout() set it to 2.
+** Stop the ncr_timeout process - lock it to ensure no timer is running
+** on a different CPU, or anything
 */
-   np->release_stage = 1;
-   for (i = 50 ; i && np->release_stage != 2 ; i--) MDELAY (100);
-   if (np->release_stage != 2)
-   printk("%s: the timer seems to be already stopped\n",
-   ncr_name(np));
-   else np->release_stage = 2;
+   NCR_LOCK_NCB(np, flags);
+   del_timer(&np->timer);
+   NCR_UNLOCK_NCB(np, flags);
 
 /*
 ** Reset NCR chip.
@@ -7273,7 +7289,6 @@
 
return 1;
 }
-#endif
 
 /*==
 **
@@ -8600,23 +8615,11 @@
 {
u_long  thistime = ktime_get(0);
 
-   /*
-   **  If release process in progress, let's go
-   **  Set the release stage from 1 to 2 to synchronize
-   **  with the release process.
-   */
-
-   if (np->release_stage) {
-   if (np->release_stage == 1) np->release_stage = 2;
-   return;
-   }
-
 #ifdef SCSI_NCR_PCIQ_BROKEN_INTR
-   np->timer.expires = ktime_get((HZ+9)/10);
+   mod_timer(&np->timer, ktime_get((HZ+9)/10));
 #else
-   np->timer.expires = ktime_get(SCSI_NCR_TIMER_INTERVAL);
+   mod_timer(&np->timer, ktime_get(SCSI_NCR_TIMER_INTERVAL));
 #endif
-   add_timer(&np->timer);
 
/*
**  If we are resetting the ncr, wait for settle_time before 
@@ -13071,7 +13075,7 @@
(int) (PciDeviceFn(pdev) & 7));
 
 #ifdef SCSI_NCR_DYNAMIC_DMA_MAPPING
-   if (pci_set_dma_mask(pdev, (dma_addr_t) (0xUL))) {
+   if (!

[PATCH] HPT370 misc

2001-05-31 Thread Tim Hockin

Andre,

Attached is a patch for hpt366.c for the following:
better support for multiple controllers
better /proc output
66 MHz PCI timings
implement the HDIO_GET/SET_BUSSTATE ioctls (see previous patch)

This patch does rely on the PCI busspeed patch (sent to lkml earlier).

Please let me know if you have any problems with this for general
inclusion.

Tim

-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] HPT370 misc (for real this time)

2001-05-31 Thread Tim Hockin

Andre,

Attached is a patch for hpt366.c for the following:
better support for multiple controllers
better /proc output
66 MHz PCI timings
implement the HDIO_GET/SET_BUSSTATE ioctls (see previous patch)

This patch does rely on the PCI busspeed patch (sent to lkml earlier).

Please let me know if you have any problems with this for general
inclusion.

Tim

-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

diff -ruN dist-2.4.5/drivers/ide/hpt366.c cobalt-2.4.5/drivers/ide/hpt366.c
--- dist-2.4.5/drivers/ide/hpt366.c Sat May 19 17:43:06 2001
+++ cobalt-2.4.5/drivers/ide/hpt366.c   Thu May 31 14:32:15 2001
@@ -11,6 +11,17 @@
  *
  * Note that final HPT370 support was done by force extraction of GPL.
  *
+ * add function for getting/setting power status of drive
+ * Adrian Sun <[EMAIL PROTECTED]>
+ *
+ * add drive timings for 66MHz PCI bus,
+ * fix ATA Cable signal detection, fix incorrect /proc info
+ * add /proc display for per-drive PIO/DMA/UDMA mode and
+ * per-channel ATA-33/66 Cable detect.
+ * Duncan Laurie <[EMAIL PROTECTED]>
+ *
+ * fixup /proc output for multiple controllers
+ *     Tim Hockin <[EMAIL PROTECTED]>
  */
 
 #include 
@@ -28,6 +39,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
@@ -170,62 +182,126 @@
{   0,  0x06514e57, 0x06514e57  }
 };
 
+struct chipset_bus_clock_list_entry sixty_six_base_hpt370[] = {
+   {   XFER_UDMA_5,0x14846231, 0x14846231  },
+   {   XFER_UDMA_4,0x14886231, 0x14886231  },
+   {   XFER_UDMA_3,0x148c6231, 0x148c6231  },
+   {   XFER_UDMA_2,0x148c6231, 0x148c6231  },
+   {   XFER_UDMA_1,0x14906231, 0x14906231  },
+   {   XFER_UDMA_0,0x14986231, 0x14986231  },
+   
+   {   XFER_MW_DMA_2,  0x26514e21, 0x26514e21  },
+   {   XFER_MW_DMA_1,  0x26514e33, 0x26514e33  },
+   {   XFER_MW_DMA_0,  0x26514e97, 0x26514e97  },
+   
+   {   XFER_PIO_4, 0x06514e21, 0x06514e21  },
+   {   XFER_PIO_3, 0x06514e22, 0x06514e22  },
+   {   XFER_PIO_2, 0x06514e33, 0x06514e33  },
+   {   XFER_PIO_1, 0x06914e43, 0x06914e43  },
+   {   XFER_PIO_0, 0x06914e57, 0x06914e57  },
+   {   0,  0x06514e57, 0x06514e57  }
+};
+
 #define HPT366_DEBUG_DRIVE_INFO0
 #define HPT370_ALLOW_ATA100_5  1
 #define HPT366_ALLOW_ATA66_4   1
 #define HPT366_ALLOW_ATA66_3   1
+#define HPT366_MAX_DEVS8
+
+static struct pci_dev *hpt_devs[HPT366_MAX_DEVS];
+static int n_hpt_devs;
+
+static unsigned int pci_rev_check_hpt3xx(struct pci_dev *dev);
+static unsigned int pci_rev2_check_hpt3xx(struct pci_dev *dev);
+byte hpt366_proc = 0;
+byte hpt363_shared_irq;
+byte hpt363_shared_pin;
+extern char *ide_xfer_verbose (byte xfer_rate);
 
 #if defined(DISPLAY_HPT366_TIMINGS) && defined(CONFIG_PROC_FS)
 static int hpt366_get_info(char *, char **, off_t, int);
 extern int (*hpt366_display_info)(char *, char **, off_t, int); /* ide-proc.c */
 extern char *ide_media_verbose(ide_drive_t *);
-static struct pci_dev *bmide_dev;
-static struct pci_dev *bmide2_dev;
 
 static int hpt366_get_info (char *buffer, char **addr, off_t offset, int count)
 {
-   char *p = buffer;
-   u32 bibma   = bmide_dev->resource[4].start;
-   u32 bibma2  = bmide2_dev->resource[4].start;
-   char *chipset_names[] = {"HPT366", "HPT366", "HPT368", "HPT370", "HPT370A"};
-   u8  c0 = 0, c1 = 0;
-   u32 class_rev;
-
-   pci_read_config_dword(bmide_dev, PCI_CLASS_REVISION, &class_rev);
-   class_rev &= 0xff;
-
-/*
- * at that point bibma+0x2 et bibma+0xa are byte registers
- * to investigate:
- */
-   c0 = inb_p((unsigned short)bibma + 0x02);
-   if (bmide2_dev)
-   c1 = inb_p((unsigned short)bibma2 + 0x02);
-
-   p += sprintf(p, "\n%s Chipset.\n", 
chipset_names[class_rev]);
-   p += sprintf(p, "--- Primary Channel  Secondary 
Channel -\n");
-   p += sprintf(p, "%sabled %sabled\n",
-   (c0&0x80) ? "dis" : " en",
-   (c1&0x80) ? "dis" : " en");
-   p += sprintf(p, "--- drive0 - drive1  drive0 
-- drive1 --\n");
-   p += sprintf(p, "DMA enabled:%s  %s %s 
  %s\n",
-   (c0&0x20) ? "yes" : "no &q

[PATCH] support for Cobalt Networks (x86 only) systems

2001-05-31 Thread Tim Hockin

Alan,

Aattached is a (large, but self contained) patch for Cobalt Networks suport
for x86 systems (RaQ3, RaQ4, Qube3, RaQXTR).  Please let me know if there
is anything that would prevent this from general inclusion in the next
release.

(patch against 2.4.5)

Thanks

Tim
-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] almost forgot this one

2001-05-31 Thread Tim Hockin

Add a rwproc entry to the ide structure, for recalling what happened last
time!

Please let me knwo if there are any problems with this patch (some of the
patches I sent earlier depend on this).

Tim
-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

diff -ruN dist-2.4.5/include/linux/ide.h ../cobalt-2.4.5/include/linux/ide.h
--- dist-2.4.5/include/linux/ide.h  Thu May 31 18:22:46 2001
+++ ../cobalt-2.4.5/include/linux/ide.h Thu May 31 14:33:16 2001
@@ -284,6 +284,7 @@
unsigned long service_time; /* service time of last request */
unsigned long timeout;  /* max time to wait for irq */
special_t   special;/* special action flags */
+   void *rwproc_cache; /* last rwproc update */
byte keep_settings; /* restore settings after drive reset */
byte using_dma; /* disk is using dma for read/write */
byte waiting_for_dma;   /* dma currently in progress */



[PATCH] support for Cobalt Networks (x86 only) systems (for real this time)

2001-05-31 Thread Tim Hockin

apparently, LKML silently (!) bounces messages > a certain size.  So I'll
try smaller patches.  This is part 2/2 of the general Cobalt support.

Alan,

Aattached is a (large, but self contained) patch for Cobalt Networks suport
for x86 systems (RaQ3, RaQ4, Qube3, RaQXTR).  Please let me know if there
is anything that would prevent this from general inclusion in the next
release.

(patch against 2.4.5)

Thanks

Tim
-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

diff -ruN dist-2.4.5/drivers/cobalt/README cobalt-2.4.5/drivers/cobalt/README
--- dist-2.4.5/drivers/cobalt/READMEWed Dec 31 16:00:00 1969
+++ cobalt-2.4.5/drivers/cobalt/README  Thu May 31 14:32:15 2001
@@ -0,0 +1,19 @@
+Notes on Cobalt's drivers:
+
+You will notice in several places constructs such as this:
+
+   if (cobt_is_3k()) {
+   foo();
+   } else if (cobt_is_5k()) {
+   bar();
+   }
+
+The goal here is to only compile in code that is needed, but to allow one to
+define support for both 3k and 5k (and more?) style systems.  The systype
+check macros are very simple and clean.  They check whether config-time
+support for the generation has been enabled, and (if so) whether systype
+detected the specified generation.  This leaves the code free from #ifdef
+cruft, but lets the compiler throw out unsupported generation-specific code
+with if (0) detection.
+
+--
diff -ruN dist-2.4.5/drivers/cobalt/ruler.c cobalt-2.4.5/drivers/cobalt/ruler.c
--- dist-2.4.5/drivers/cobalt/ruler.c   Wed Dec 31 16:00:00 1969
+++ cobalt-2.4.5/drivers/cobalt/ruler.c Thu May 31 14:32:15 2001
@@ -0,0 +1,393 @@
+/* 
+ * cobalt ruler driver 
+ * Copyright (c) 2000, Cobalt Networks, Inc.
+ * $Id: ruler.c,v 1.10 2001/05/30 07:19:48 thockin Exp $
+ *
+ * author: [EMAIL PROTECTED], [EMAIL PROTECTED]
+ *
+ * This should be SMP safe.  There are two critical pieces of data, and thsu
+ * two locks.  The ruler_lock protects the arrays of channels(hwifs) and
+ * busproc function pointers.  These are only ever written in the
+ * register/unregister functions but read in several other places.  A
+ * read/write lock is appropriate.  The second lock is the lock on the sled
+ * led state and the I2C_DEV_RULER.  It gets called from timer context, so
+ * irqsave it. The global switches and sled_leds are atomic_t. --TPH
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define RULER_TIMEOUT  (HZ >> 1)  /* .5s */
+#define MAX_COBT_DRIVES4
+#define LED_SLED0  (1 << 3)
+#define LED_SLED1  (1 << 2)
+#define LED_SLED2  (1 << 1)
+#define LED_SLED3  (1 << 0)
+
+/* all of this is for gen V */
+static struct timer_list cobalt_ruler_timer;
+static rwlock_t ruler_lock = RW_LOCK_UNLOCKED;
+static spinlock_t rled_lock = SPIN_LOCK_UNLOCKED;
+static ide_hwif_t *channels[MAX_COBT_DRIVES];
+static ide_busproc_t *busprocs[MAX_COBT_DRIVES];
+/* NOTE: switches is a bitmask of DETACHED sleds */
+static atomic_t switches = ATOMIC_INIT(0); 
+static atomic_t sled_leds = ATOMIC_INIT(0);
+static int sled_led_map[] = {LED_SLED0, LED_SLED1, LED_SLED2, LED_SLED3};
+static int ruler_detect;
+
+static inline u8
+read_switches(void)
+{
+   u8 state = 0;
+   if (cobt_is_5k()) {
+   int tries = 3;
+
+   /* i2c can be busy, and this can read wrong - try a few times */
+   while (tries--) {
+   state = cobalt_i2c_read_byte(COBALT_I2C_DEV_DRV_SWITCH, 
+   0);
+   if ((state & 0xf0) != 0xf0) {
+   break;
+   }
+   }
+   }
+
+   return state;
+}
+
+/*
+ * deal with sled leds: LED on means OK to remove
+ * NOTE: all the reset lines are kept high. 
+ * NOTE: the reset lines are in the reverse order of the switches. 
+ */
+static void
+set_sled_leds(u8 leds)
+{
+   if (cobt_is_5k()) {
+   unsigned long flags;
+
+   spin_lock_irqsave(&rled_lock, flags);
+
+   atomic_set(&sled_leds, leds);
+   leds |= 0xf0;
+   cobalt_i2c_write_byte(COBALT_I2C_DEV_RULER, 0, leds);
+
+   spin_unlock_irqrestore(&rled_lock, flags);
+   }
+}
+
+static inline u8
+get_sled_leds(void)
+{
+   return atomic_read(&sled_leds);
+}
+
+/* this must be called with the ruler_lock held for read */
+static int
+do_busproc(int idx, ide_hwif_t *hwif, int arg)
+{
+   if (cobt_is_5k()) {
+   /* sed sled LEDs */
+   switch (arg) {
+   case BUSSTATE_ON:
+   set_sled_leds(get_sled_leds

[PATCH] support for Cobalt Networks (x86 only) systems (for real this time)

2001-05-31 Thread Tim Hockin

apparently, LKML silently (!) bounces messages > a certain size.  So I'll
try smaller patches.  This is part 1/2 of the general Cobalt support.

Alan,

Aattached is a (large, but self contained) patch for Cobalt Networks suport
for x86 systems (RaQ3, RaQ4, Qube3, RaQXTR).  Please let me know if there
is anything that would prevent this from general inclusion in the next
release.

(patch against 2.4.5)

Thanks

Tim
-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

diff -ruN dist-2.4.5/drivers/cobalt/acpi.c cobalt-2.4.5/drivers/cobalt/acpi.c
--- dist-2.4.5/drivers/cobalt/acpi.cWed Dec 31 16:00:00 1969
+++ cobalt-2.4.5/drivers/cobalt/acpi.c  Thu May 31 14:32:15 2001
@@ -0,0 +1,218 @@
+/* 
+ * cobalt acpi driver 
+ * Copyright (c) 2000, Cobalt Networks, Inc.
+ * Copyright (c) 2001, Sun Microsystems, Inc.
+ * $Id: acpi.c,v 1.10 2001/05/30 07:19:47 thockin Exp $
+ *
+ * author: [EMAIL PROTECTED], [EMAIL PROTECTED]
+ *
+ * this driver just sets stuff up for ACPI interrupts
+ *
+ * if acpi support really existed in the kernel, we would read
+ * data from the ACPI tables. however, it doesn't. as a result,
+ * we use some hardcoded values. 
+ *
+ * This should be SMP safe.  The only data that needs protection is the acpi
+ * handler list.  It gets scanned at timer-interrupts, must use
+ * irqsave/restore locks. Read/write locks would be useful if there were any
+ * other times that the list was read but never written. --TPH
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define POWER_BUTTON_SHUTDOWN 0
+
+#define ACPI_IRQ   10   /* XXX: hardcoded interrupt */
+#define ACPI_NAME  "sci"
+#define ACPI_MAGIC 0xc0b7ac21
+
+#define SUPERIO_EVENT  0xff
+#define OSB4_EVENT 0x40
+#define OSB4_INDEX_PORT0x0cd6
+#define OSB4_DATA_PORT 0x0cd7
+
+/* for registering ACPI handlers */
+struct acpi_handler {
+   void (*function)(int irq, void *dev_id, struct pt_regs *regs);
+   struct acpi_handler *next;
+   struct acpi_handler *prev;
+};
+struct acpi_handler *acpi_handler_list;
+
+static spinlock_t acpi_lock = SPIN_LOCK_UNLOCKED;
+/* these two are for gen V */
+static u16 osb4_port;
+static u16 superio_port;
+
+static u16 
+get_reg(u16 index, u16 data, u8 port)
+{
+   if (cobt_is_5k()) {
+   u16 reg;
+
+   outb(port, index);
+   reg = inb(data);
+   outb(port + 1, index);
+   reg |= inb(data) << 8;
+   return reg;
+   }
+
+   return 0;
+}
+
+static void 
+acpi_interrupt(int irq, void *dev_id, struct pt_regs *regs)
+{
+   unsigned long flags, events;
+   struct acpi_handler *p;
+
+   spin_lock_irqsave(&acpi_lock, flags);
+
+   if (cobt_is_5k()) {
+   /* save the superio events */
+   events = inb(superio_port) | (inb(superio_port + 1) << 8);
+   
+   /* clear SCI interrupt generation */
+   outb(OSB4_EVENT, osb4_port); 
+   outb(SUPERIO_EVENT, superio_port);
+   outb(SUPERIO_EVENT, superio_port + 1);
+   }
+
+   /* call the ACPI handlers */
+   p = acpi_handler_list;
+   while (p) {
+   p->function(irq, dev_id, regs);
+   p = p->next;
+   }
+
+   spin_unlock_irqrestore(&acpi_lock, flags);
+}
+
+int
+cobalt_acpi_register_handler(void (*function)(int, void *, struct pt_regs *))
+{
+   struct acpi_handler *newh;
+   unsigned long flags;
+
+   newh = kmalloc(sizeof(*newh), GFP_ATOMIC);
+   if (!newh) {
+   EPRINTK("can't allocate memory for handler %p\n", function);
+   return -1;
+   }
+
+   spin_lock_irqsave(&acpi_lock, flags);
+
+   /* head insert */
+   newh->function = function;
+   newh->next = acpi_handler_list;
+   newh->prev = NULL;
+   if (acpi_handler_list) {
+   acpi_handler_list->prev = newh;
+   }
+   acpi_handler_list = newh;   
+
+   spin_unlock_irqrestore(&acpi_lock, flags);
+
+   return 0;
+}
+
+int
+cobalt_acpi_unregister_handler(void (*function)(int, void *, struct pt_regs *))
+{
+   struct acpi_handler *p;
+   unsigned long flags;
+   int r = -1;
+
+   spin_lock_irqsave(&acpi_lock, flags);
+
+   p = acpi_handler_list;
+   while (p) {
+   if (p->function == function) {
+   if (p->prev) {
+   p->prev->next = p->next;
+   }
+   if (p->next) {
+   p->next->prev = p->prev;
+   }
+   r = 0;
+   break;
+   }
+

Re: [PATCH] support for Cobalt Networks (x86 only) systems (for real this time)

2001-05-31 Thread Tim Hockin

> Looks interesting. Seemingly literate use of spinlocks.

thanks - I gave it lots of thought.

> Off-hand I see old style initialization. Is it right for new driver?

the old-style init is because it is an old driver.  I want to do a full-on
rework, but haven't had the time.

> i2c framework is not used, I wonder why. Someone thought that
> it was too heavy perhaps? If so, I disagree.

i2c is only in our stuff because the i2c core is not in the standard kernel
yet.  As soon as it is, I will make cobalt_i2c* go away.

> if any alignment with lm-sensors is possible, for the sake of

yes - I have communicated with the lm-sensors crew.  It is very high on my
wishlist.

> lcd_read bounces reads with -EINVAL when another read is in
> progress. Gross.

as I said - I didn't write the LCD driver, I just had to port it up :)  I
want to re-do the whole paradigm of it (it has been ported forward since 
2.0.3x)

> 1.:
>   p = head;
>   while (p) {
>   p = p->next;
>   }
> 
> It is what for(;;) does.

I don't get it - are you saying you do or don't like the while (p)
approach?  I think it is clearer because it is more true ot the heuristic -
"start at the beginning and walk down the list".

> 2. Spaces and tabs are mixed in funny ways, makes to cute effects
> when quoting diffs.

I've tried to eliminate that when I see it - I'll give the diff a close
examination.

thanks for the feedback - it will be nice to not have to constantly port
all our changes to each kernel release.  There are still some patches (of
course) but I didn't submit them because they are VERY specific to cobalt -
for example in the ide probing calling cobalt_ruler_register().  Ifdefs
protect, but the overall appearance would be rejected, I suspect - no?

Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



cobalt patches

2001-06-01 Thread Tim Hockin

Thanks for all the comments, so far.  I'm going to incorporate all the
changes mentioned, and resubmit the questioned patches.

Tim

-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] support for Cobalt Networks (x86 only) systems (for real this time)

2001-06-01 Thread Tim Hockin

Pete Zaitcev wrote:

> > i2c is only in our stuff because the i2c core is not in the standard kernel
> > yet.  As soon as it is, I will make cobalt_i2c* go away.
> 
> I am puzzled by this comment. Did you look into drivers/i2c/?
> It certainly is a part of a stock kernel. The main user is
> the V4L, in drivers/media/video, but I think LM sensors use it too.

sorry, I meant to say:  The core is in, but the drivers for the adapters in
question are not.  They are part of lm_sensors, and as such, make it very
hard for us to maintain.  I have encouraged the lm_sensors crew to get at
LEAST the adapters/algorithms submitted for general inclusion.  Once that
is in, I will make cobalt_i2c go away.

Tim
-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] save source address on accept()

2001-05-31 Thread Tim Hockin

All,

attached is a (small) patch which saves the src address on tcp_accept(). 
Please let me know if there are any problems taking this for general
inclusion.

Tim

-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

diff -ruN dist-2.4.5/net/ipv4/tcp.c cobalt-2.4.5/net/ipv4/tcp.c
--- dist-2.4.5/net/ipv4/tcp.c   Wed May 16 10:31:27 2001
+++ cobalt-2.4.5/net/ipv4/tcp.c Thu May 31 14:33:23 2001
@@ -2138,6 +2138,7 @@
tp->accept_queue_tail = NULL;
 
newsk = req->sk;
+   newsk->rcv_saddr = req->af.v4_req.loc_addr;
tcp_acceptq_removed(sk);
tcp_openreq_fastfree(req);
BUG_TRAP(newsk->state != TCP_SYN_RECV);



Re: pset patch??

2001-06-07 Thread Tim Hockin

Khalid Aziz wrote:
> 
> Try
> <http://resourcemanagement.unixsolutions.hp.com/WaRM/schedpolicy.html>.
> It may do what you want.

> > I see references to this site http://isunix.it.ilstu.edu/~thockin/pset/.


try http://www.hockin.org/~thockin/pset

unfortunately, not ported to 2.4.x yet - should be easy, and is a more
complete implementation of sysmp() than the others..

-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Lost Ticks on x86_64

2005-08-07 Thread Tim Hockin
On Sun, Aug 07, 2005 at 01:36:01PM +0200, Andi Kleen wrote:
> Erick Turnquist <[EMAIL PROTECTED]> writes:
> 
> > Hi, I'm running an Athlon64 X2 4400+ (a dual core model) with an
> > nVidia GeForce 6800 Ultra on a Gigabyte GA-K8NXP-SLI motherboard and
> > getting nasty messages like these in my dmesg:
> > 
> > warning: many lost ticks.
> > Your time source seems to be instable or some driver is hogging interupts
> > rip default_idle+0x20/0x30
> 
> It's most likely bad SMM code in the BIOS that blocks the CPU too long
> and is triggered in idle. You can verify that by using idle=poll
> (not recommended for production, just for testing) and see if it goes away.
> 
> No way to fix this, but you can work around it with very new kernels
> by compiling with a lower HZ than 1000.

Some BIOSes do not lock SMM, and you *could* turn it off at the chipset
level.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Lost Ticks on x86_64

2005-08-07 Thread Tim Hockin
On Sun, Aug 07, 2005 at 02:46:50PM -0400, Erick Turnquist wrote:
> > Some BIOSes do not lock SMM, and you *could* turn it off at the chipset
> > level.
> 
> I don't see anything about SMM in my BIOS configuration even with the
> advanced options enabled... Turning it off at the chipset level sounds
> like a hardware hack - is it?

No, it's usually just a PCI register you can change.  Depends on your
chipset, though.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Lost Ticks on x86_64

2005-08-07 Thread Tim Hockin
On Sun, Aug 07, 2005 at 02:51:19PM -0400, Lee Revell wrote:
> > It's most likely bad SMM code in the BIOS that blocks the CPU too long
> > and is triggered in idle. You can verify that by using idle=poll
> > (not recommended for production, just for testing) and see if it goes away.
> > 
> 
> WTF, since when do *desktops* use SMM?  Are you telling me that we have
> to worry about these stupid ACPI/SMM hardware bugs on the desktop too?

SMM is how BIOSes do legacy support (which stops at OS-handover).  It's
also how some BIOSes do ECC reporting and logging.

We just do pci tweaks to turn it off in the OS.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Lost Ticks on x86_64

2005-08-08 Thread Tim Hockin
On Mon, Aug 08, 2005 at 02:01:25PM +0200, Andi Kleen wrote:
> > Some BIOSes do not lock SMM, and you *could* turn it off at the chipset
> > level.
> 
> Doing so would be wasteful though. Both AMD and Intel CPUs need SMM code
> for the deeper C* sleep states.

Really?  I'm not too familiar with the deeper C states - what role does
SMM play?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Watchdog] alim7101_wdt problem on 2.6.10

2005-01-30 Thread Tim Hockin
On Mon, Jan 31, 2005 at 08:22:11AM +0100, Emmanuel Fleury wrote:
> Jan 30 00:58:21 hermes vmunix: alim7101_wdt: ALi 1543 South-Bridge does
> not have the correct revision number (???1001?) - WDT
> not set
> 
> What did I do wrong ?

You used the wrong South Bridge revision.  Seriously, older revisions of
M7101 did not have a WDT.  You seem to have an older revision.  Sorry.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Watchdog] alim7101_wdt problem on 2.6.10

2005-01-31 Thread Tim Hockin
On Mon, Jan 31, 2005 at 01:55:59PM -0500, Mike Waychison wrote:
> FWIW, some of the old cobalt boxes had the old south bridge revision
> with a WDT.  It managed to do resets/wdt off gpio pin 5 though, and
> there is a patch in Alan's 2.6.10-ac tree that handles it.

The old Cobalts also had a PIC attached to act as a WDT.  :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


EEPRO100/S support

2001-06-15 Thread Tim Hockin

Hey all,
I just had an eepro/100 S delivered to me.  I haven't dug through specs
yet, but has anyone looke at this?  Supposedly has a 3DES ASIC built in to
the core.

Any way we can use it?

-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



/dev/nvram driver

2001-06-22 Thread Tim Hockin

Who is maintaining the /dev/nvram driver?  I have a couple things I want to
suggest/ask.  

Currently it tracks O_EXCL on open() and sets a flag, whereby no other
open() calls can succeed.  Is this functionality really needed?  Perhaps it
should just be a reader/writer model : n readers or 1 writer.  In that
case, should open() block on a writer, or return -EBUSY?

nvram_release() calls lock_kernel() - any particular reason?

various other bits (nvram_open_cnt, for example) are not SMP safe.  I'm
making sure it is safe now.

What I really want to know is: should I bother making nvram_open_cnt SMP
safe, or should it just go away all together.  I vote for the latter
option, unless something depends on this behavior (in which case, other
fixes are needed, because it is broken :).

comments?


-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: MTD compiling error

2001-07-19 Thread Tim Hockin

> /usr/src/linux-2.4.6/include/linux/mtd/cfi.h: In function `cfi_spin_unlock':
> /usr/src/linux-2.4.6/include/linux/mtd/cfi.h:387: `do_softirq' undeclared 
> (first use in this function)
> /usr/src/linux-2.4.6/include/linux/mtd/cfi.h:387: (Each undeclared identifier 
> is reported only once
> /usr/src/linux-2.4.6/include/linux/mtd/cfi.h:387: for each function it 
> appears in.)
> make[3]: *** [cfi_probe.o] Error 1
> make[3]: Leaving directory `/usr/src/linux-2.4.6/drivers/mtd/chips'
> make[2]: *** [_modsubdir_chips] Error 2
> make[2]: Leaving directory `/usr/src/linux-2.4.6/drivers/mtd'
> make[1]: *** [_modsubdir_mtd] Error 2
> make[1]: Leaving directory `/usr/src/linux-2.4.6/drivers'
> make: *** [_mod_drivers] Error 2
> [root@kiwiunix linux]#
> 
> After adding #include  in the CFI.h header file, the result was 
> that there is a undeclared identifier. Since I don't know C (Only java, BBC 
> Basic, and COBOL), I don't know how to correct the problem.


include  I believe.  It is fixed in MTD's CVS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: HPT366 IDE DMA error question.

2001-04-27 Thread Tim Hockin

Mike Panetta wrote:

> hdi: timeout waiting for DMA
> ide_dmaproc: chipset supported ide_dma_timeout func only: 14
> hdi: irq timeout: status=0x58 { DriveReady SeekComplete DataRequest }
> hdi: DMA disabled
> ide4: reset: success
> 
> I get this message on all my off board HPT366 based controller
> cards.  I am using these cards with seagate Barracuda ATA III
> Model ST320414A 20GB drives.  Are there any known issues with
> these drives and the HPT366 based controllers?  Are there any

we have a system with hpt 370's (366 driver) that we found the following
obvious bug in.  If you read the spec carefuly, it is obviously correct. 
You have to set DMA up for read vs. write.  Does this make your problems go
away?  DIff against 2.4.3


-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

diff -u dist-2.4.3/drivers/ide/hpt366.c linux-2.4/drivers/ide/hpt366.c
--- dist-2.4.3/drivers/ide/hpt366.c Sat Jan 27 08:45:58 2001
+++ linux-2.4/drivers/ide/hpt366.c  Thu Apr 26 20:15:17 2001
@@ -523,9 +638,11 @@
 
 void hpt370_rw_proc (ide_drive_t *drive, ide_dma_action_t func)
 {
-   if ((func != ide_dma_write) || (func != ide_dma_read))
+   if ((func != ide_dma_write && func != ide_dma_read)
+|| drive->rwproc_cache == (void *)func)
return;
hpt370_tune_chipset(drive, drive->current_speed, (func == ide_dma_write));
+   drive->rwproc_cache = (void *)func;
 }
 
 static int config_drive_xfer_rate (ide_drive_t *drive)
diff -u dist-2.4.3/include/linux/ide.h linux-2.4/include/linux/ide.h
--- dist-2.4.3/include/linux/ide.h  Mon Jan 29 23:25:32 2001
+++ linux-2.4/include/linux/ide.h   Thu Apr 26 20:16:00 2001
@@ -284,6 +284,7 @@
unsigned long service_time; /* service time of last request */
unsigned long timeout;  /* max time to wait for irq */
special_t   special;/* special action flags */
+   void *rwproc_cache; /* last rwproc update */
byte keep_settings; /* restore settings after drive reset */
byte using_dma; /* disk is using dma for read/write */
byte waiting_for_dma;   /* dma currently in progress */
 
 



PATCH: fix mxser driver for MOXA C104/PCI

2001-05-01 Thread Tim Hockin

The attached patch fixes the MOXA driver properly.  Indexing is 0 based, so
rather than adjust the enum, don't subtract 1 from each index.  Also use a
for loop for the PCI devices, and up the version number.


-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

--- drivers/char/mxser.c.orig   Tue May  1 21:00:50 2001
+++ drivers/char/mxser.cTue May  1 21:01:06 2001
@@ -27,10 +27,11 @@
  *
  *  Copyright (C) 1999,2000  Moxa Technologies Co., LTD.
  *
- *  for : LINUX 2.0.X, 2.2.X
- *  date: 1999/07/22
- *  version : 1.1 
+ *  for : LINUX 2.0.X, 2.2.X, 2.4.X
+ *  date: 2001/05/01
+ *  version : 1.2 
  *  
+ *Fixes for C104H/PCI by Tim Hockin <[EMAIL PROTECTED]>
  */
 
 #include 
@@ -61,7 +62,7 @@
 #include 
 #include 
 
-#defineMXSER_VERSION   "1.1kern"
+#defineMXSER_VERSION   "1.2"
 
 #defineMXSERMAJOR  174
 #defineMXSERCUMAJOR175
@@ -120,7 +121,7 @@
 #define CI104J_ASIC_ID  5
 
 enum {
-   MXSER_BOARD_C168_ISA = 1,
+   MXSER_BOARD_C168_ISA = 0,
MXSER_BOARD_C104_ISA,
MXSER_BOARD_CI104J,
MXSER_BOARD_C168_PCI,
@@ -434,7 +425,7 @@
 "mxser", info);
if (retval) {
restore_flags(flags);
-   printk("Board %d: %s", board, mxser_brdname[hwconf->board_type - 1]);
+   printk("Board %d: %s", board, mxser_brdname[hwconf->board_type]);
printk("  Request irq fail,IRQ (%d) may be conflit with another 
device.\n", info->irq);
return (retval);
}
@@ -455,7 +446,7 @@
unsigned int ioaddress;
 
hwconf->board_type = board_type;
-   hwconf->ports = mxser_numports[board_type - 1];
+   hwconf->ports = mxser_numports[board_type];
ioaddress = pci_resource_start (pdev, 2);
for (i = 0; i < hwconf->ports; i++)
hwconf->ioaddr[i] = ioaddress + 8 * i;
@@ -544,7 +535,7 @@
 
if (retval != 0)
printk("Found MOXA %s board (CAP=0x%x)\n",
-  mxser_brdname[hwconf.board_type - 1],
+  mxser_brdname[hwconf.board_type],
   ioaddr[b]);
 
if (retval <= 0) {
@@ -579,7 +570,7 @@
 
if (retval != 0)
printk("Found MOXA %s board (CAP=0x%x)\n",
-  mxser_brdname[hwconf.board_type - 1],
+  mxser_brdname[hwconf.board_type],
   ioaddr[b]);
 
if (retval <= 0) {
@@ -612,21 +603,15 @@
 
n = sizeof(mxser_pcibrds) / sizeof(mxser_pciinfo);
index = 0;
-   b = 0;
-   while (b < n) {
+   for (b = 0; b < n; b++) {
pdev = pci_find_device(mxser_pcibrds[b].vendor_id,
   mxser_pcibrds[b].device_id, pdev);
-   if (!pdev)
-   {
-   b++;
-   continue;
-   }
-   if (pci_enable_device(pdev))
+   if (!pdev || pci_enable_device(pdev))
continue;
hwconf.pdev = pdev;
printk("Found MOXA %s board(BusNo=%d,DevNo=%d)\n",
-   mxser_brdname[mxser_pcibrds[b].board_type - 1],
-   pdev->bus->number, PCI_SLOT(pdev->devfn >> 3));
+   mxser_brdname[mxser_pcibrds[b].board_type],
+   pdev->bus->number, PCI_SLOT(pdev->devfn));
if (m >= MXSER_BOARDS) {
printk("Too many Smartio family boards found (maximum 
%d),board not configured\n", MXSER_BOARDS);
} else {
@@ -1352,7 +1337,7 @@
return;
if (port == 0)
return;
-   max = mxser_numports[mxsercfg[i].board_type - 1];
+   max = mxser_numports[mxsercfg[i].board_type];
 
while (1) {
irqbits = inb(port->vector) & port->vectormask;



PATCH: sym53c8xxx.c timer handling

2001-05-16 Thread Tim Hockin

Gerard, LKML

The attached patch tweaks the sym53c8xx timeout timer handling a bit.

Is it correct?

It works with interrupts off (we need a reboot notifier to iterate the
hosts list, with IRQs off) and doesn't delay() 5 seconds in worst case.

Is there any reason this patch won't work?  It seems OK to me.

Tim

-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]

--- /home/users/admin/C26/III/linux-2.2/drivers/scsi/sym53c8xx.cTue May 15 
19:14:48 2001
+++ /usr/src/linux/drivers/scsi/sym53c8xx.c Wed May 16 22:52:12 2001
@@ -2265,7 +2265,6 @@
**
*/
struct usrcmd   user;   /* Command from user*/
-   volatile u_char release_stage;  /* Synchronisation stage on release  */
 
/*
**  Fields that are used (primarily) for integrity check
@@ -5884,7 +5883,12 @@
**  start the timeout daemon
*/
np->lasttime=0;
-   ncr_timeout (np);
+#ifdef SCSI_NCR_PCIQ_BROKEN_INTR
+   np->timer.expires = ktime_get((HZ+9)/10);
+#else
+   np->timer.expires = ktime_get(SCSI_NCR_TIMER_INTERVAL);
+#endif
+   add_timer(&np->timer);
 
/*
**  use SIMPLE TAG messages by default
@@ -7244,20 +7248,17 @@
 
 static int ncr_detach(ncb_p np)
 {
-   int i;
+   unsigned long flags;
 
printk("%s: detaching ...\n", ncr_name(np));
 
 /*
-** Stop the ncr_timeout process
-** Set release_stage to 1 and wait that ncr_timeout() set it to 2.
+** Stop the ncr_timeout process - lock it to ensure no timer is running
+** on a different CPU, or anything
 */
-   np->release_stage = 1;
-   for (i = 50 ; i && np->release_stage != 2 ; i--) MDELAY (100);
-   if (np->release_stage != 2)
-   printk("%s: the timer seems to be already stopped\n",
-   ncr_name(np));
-   else np->release_stage = 2;
+   NCR_LOCK_NCB(np, flags);
+   del_timer(&np->timer);
+   NCR_UNLOCK_NCB(np, flags);
 
 /*
 ** Reset NCR chip.
@@ -8613,23 +8614,11 @@
 {
u_long  thistime = ktime_get(0);
 
-   /*
-   **  If release process in progress, let's go
-   **  Set the release stage from 1 to 2 to synchronize
-   **  with the release process.
-   */
-
-   if (np->release_stage) {
-   if (np->release_stage == 1) np->release_stage = 2;
-   return;
-   }
-
 #ifdef SCSI_NCR_PCIQ_BROKEN_INTR
-   np->timer.expires = ktime_get((HZ+9)/10);
+   mod_timer(&np->timer, ktime_get((HZ+9)/10));
 #else
-   np->timer.expires = ktime_get(SCSI_NCR_TIMER_INTERVAL);
+   mod_timer(&np->timer, ktime_get(SCSI_NCR_TIMER_INTERVAL));
 #endif
-   add_timer(&np->timer);
 
/*
**  If we are resetting the ncr, wait for settle_time before 



Re: Linux-2.4.4 failure to compile

2001-05-17 Thread Tim Hockin

"Richard B. Johnson" wrote:
> 
> Hello;
> 
> I downloaded linux-2.4.4. The basic kernel compiles but the aic7xxx
> SCSI module that I require on some machines, doesn't.

The aic7xxx assembler requiring libdb1 is a bungle.  Getting the headers
for that right on various distros is not easy.  Add to that it requires
YACC, when most people have bison (yes, a shell script is easy to make, but
not always an option). 


-- 
Tim Hockin
Systems Software Engineer
Sun Microsystems, Cobalt Server Appliances
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)

2007-07-12 Thread Tim Hockin

On 7/12/07, Andi Kleen <[EMAIL PROTECTED]> wrote:


> -- there may be other edge cases other than
> this one. I'm actually surprised that this wasn't a ring buffer to start
> with -- it certainly seems like it wanted to be one.

The problem with a ring buffer is that it would lose old entries; but
for machine checks you really want the first entries because
the later ones might be just junk.


Couldn't the ring just have logic to detect an overrun and stop
logging until that is alleviated?  Similar to what is done now.  Maybe
I am underestimating it..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_64: mcelog tolerant level cleanup

2007-05-16 Thread Tim Hockin
From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 The MCE handler has several paths that it can take, depending on various
 conditions of the MCE status and the value of the 'tolerant' knob.  The
 exact semantics are not well defined and the code is a bit twisty.

Description:
 This patch makes the MCE handler's behavior more clear by documenting the
 behavior for various 'tolerant' levels.  It also fixes or enhances
 several small things in the handler.  Specifically:
 * If RIPV is set it is not safe to restart, so set the 'no way out'
   flag rather than the 'kill it' flag.
 * Don't panic() on correctable MCEs.
 * If the _OVER bit is set *and* the _UC bit is set (meaning possibly
   dropped uncorrected errors), set the 'no way out' flag.
 * Use EIPV for testing whether an app can be killed (SIGBUS) rather
   than RIPV.  According to docs, EIPV indicates that the error is
   related to the IP, while RIPV simply means the IP is valid to
   restart from.
 * Don't clear the MCi_STATUS registers until after the panic() path.
   This leaves the status bits set after the panic() so clever BIOSes
   can find them (and dumb BIOSes can do nothing).

 This patch also calls nonseekable_open() in mce_open (as suggested by akpm).

Result:
 Tolerant levels behave almost identically to how they always have, but
 not it's well defined.  There's a slightly higher chance of panic()ing
 when multiple errors happen (a good thing, IMHO).  If you take an MBE and
 panic(), the error status bits are not cleared.

Alternatives:
 None.

Testing:
 I used software to inject correctable and uncorrectable errors.  With
 tolerant = 3, the system usually survives.  With tolerant = 2, the system
 usually panic()s (PCC) but not always.  With tolerant = 1, the system
 always panic()s.  When the system panic()s, the BIOS is able to detect
 that the cause of death was an MC4.  I was not able to reproduce the
 case of a non-PCC error in userspace, with EIPV, with (tolerant < 3).
 That will be rare at best.

Patch:
 This patch is against 2.6.21-mm.

Signed-off-by: Tim Hockin <[EMAIL PROTECTED]>

---

This is the first version of this patch.


diff -pruN linux-2.6.21+01+02+03/Documentation/x86_64/boot-options.txt 
linux-2.6.21+04/Documentation/x86_64/boot-options.txt
--- linux-2.6.21+01+02+03/Documentation/x86_64/boot-options.txt 2007-04-25 
20:08:32.0 -0700
+++ linux-2.6.21+04/Documentation/x86_64/boot-options.txt   2007-05-09 
23:29:01.0 -0700
@@ -14,9 +14,11 @@ Machine check
mce=nobootlog
Disable boot machine check logging.
mce=tolerancelevel (number)
-   0: always panic, 1: panic if deadlock possible,
-   2: try to avoid panic, 3: never panic or exit (for testing)
-   default is 1
+   0: always panic on uncorrected errors, log corrected errors
+   1: panic or SIGBUS on uncorrected errors, log corrected errors
+   2: SIGBUS or log uncorrected errors, log corrected errors
+   3: never panic or SIGBUS, log all errors (for testing only)
+   Default is 1
Can be also set using sysfs which is preferable.
 
nomce (for compatibility with i386): same as mce=off
diff -pruN linux-2.6.21+01+02+03/Documentation/x86_64/machinecheck 
linux-2.6.21+04/Documentation/x86_64/machinecheck
--- linux-2.6.21+01+02+03/Documentation/x86_64/machinecheck 2007-05-07 
12:08:26.0 -0700
+++ linux-2.6.21+04/Documentation/x86_64/machinecheck   2007-05-09 
23:29:16.0 -0700
@@ -49,12 +49,14 @@ tolerant
Since machine check exceptions can happen any time it is sometimes
risky for the kernel to kill a process because it defies
normal kernel locking rules. The tolerance level configures
-   how hard the kernel tries to recover even at some risk of deadlock.
-
-   0: always panic,
-   1: panic if deadlock possible,
-   2: try to avoid panic,
-   3: never panic or exit (for testing only)
+   how hard the kernel tries to recover even at some risk of
+   deadlock.  Higher tolerant values trade potentially better uptime
+   with the risk of a crash or even corruption (for tolerant >= 3).
+
+   0: always panic on uncorrected errors, log corrected errors
+   1: panic or SIGBUS on uncorrected errors, log corrected errors
+   2: SIGBUS or log uncorrected errors, log corrected errors
+   3: never panic or SIGBUS, log all errors (for testing only)
 
Default: 1
 
diff -pruN linux-2.6.21+01+02+03/arch/x86_64/kernel/mce.c 
linux-2.6.21+04/arch/x86_64/kernel/mce.c
--- linux-2.6.21+01+02+03/arch/x86_64/kernel/mce.c  2007-05-09 
22:05:48.0 -0700
+++ linux-2.6.21+04/arch/x86_64/kernel/mce.c2007-05-11 21:02:12.0 
-0700
@@ -37,8 +37,13 @@ atomic_t mce_entry;
 
 static int mce_

[PATCH] x86_64: mce poll at IDLE_START and printk fix

2007-05-17 Thread Tim Hockin
From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 The MCE handler already has an idle-task handler which checks for the
 TIF_MCE_NOTIFY flag.  Given that the system is idle at that point, we can
 get even better granularity of MCE logging by polling for MCEs whenever
 we enter the idle loop.  This exposes a small imperfection in the
 printk() rate limiting whereby that last "Events Logged" message might
 not get printed if no more MCEs arrive.

Description:
 This patch extends the MCE idle notifier callback to poll for MCEs on the
 current CPU at IDLE_START time.  It also adds one new static variable to
 track whether any events have been logged since the last printk() and
 causes a printk at the next rate-limited opportunity.

Result:
 MCEs are found more rapidly on systems with bad memory.

Alternatives:
 None.

Testing:
 I used software to inject correctable and uncorrectable errors.  An
 application poll()ing /dev/mcelog gets woken up very quickly after error
 injection.

Patch:
 This patch is against 2.6.21-mm.

Signed-off-by: Tim Hockin <[EMAIL PROTECTED]>

---

This is the first version of this patch.


diff -pruN linux-2.6.21+04_tolerant_cleanup/arch/x86_64/kernel/mce.c 
linux-2.6.21+05/arch/x86_64/kernel/mce.c
--- linux-2.6.21+04_tolerant_cleanup/arch/x86_64/kernel/mce.c   2007-05-11 
21:02:12.0 -0700
+++ linux-2.6.21+05/arch/x86_64/kernel/mce.c2007-05-17 15:29:00.0 
-0700
@@ -308,10 +308,10 @@ void do_machine_check(struct pt_regs * r
}
}
 
+ out:
/* notify userspace ASAP */
set_thread_flag(TIF_MCE_NOTIFY);
 
- out:
/* the last thing we do is clear state */
for (i = 0; i < banks; i++)
wrmsrl(MSR_IA32_MC0_STATUS+4*i, 0);
@@ -389,29 +389,43 @@ static void mcheck_timer(struct work_str
  */
 int mce_notify_user(void)
 {
+   static int do_printk;
+   int retval = 0;
+
clear_thread_flag(TIF_MCE_NOTIFY);
-   if (test_and_clear_bit(0, ¬ify_user)) {
-   static unsigned long last_print;
-   unsigned long now = jiffies;
 
+   /* notify userspace apps as soon as possible */
+   if (test_and_clear_bit(0, ¬ify_user)) {
wake_up_interruptible(&mce_wait);
if (trigger[0])
call_usermodehelper(trigger, trigger_argv, NULL, -1);
+   do_printk = 1;
+   retval = 1;
+   }
+
+   /* only log a message periodically */
+   if (do_printk) {
+   static unsigned long last_print;
+   unsigned long now = jiffies;
 
if (time_after_eq(now, last_print + (check_interval*HZ))) {
last_print = now;
printk(KERN_INFO "Machine check events logged\n");
+   do_printk = 0;
}
-
-   return 1;
}
-   return 0;
+
+   return retval;
 }
 
-/* see if the idle task needs to notify userspace */
+/* take advantage of idle time to manage MCEs */
 static int
 mce_idle_callback(struct notifier_block *nfb, unsigned long action, void *junk)
 {
+   /* poll for new MCEs on this CPU */
+   if (action == IDLE_START)
+   mcheck_check_cpu(NULL);
+
/* IDLE_END should be safe - interrupts are back on */
if (action == IDLE_END && test_thread_flag(TIF_MCE_NOTIFY))
mce_notify_user();
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kconfig variable "COBALT" is not defined anywhere

2007-06-03 Thread Tim Hockin

There were other patches which added more COBALT support, but they
were dropped or lost or whatever.

I would not balk at having that code yanked.  I never got around to
doing proper Cobalt support for modern kernels. :(

On 6/3/07, Roland Dreier <[EMAIL PROTECTED]> wrote:

 > > >   there is no Kconfig file which defines the selectable option
 > > > "COBALT", which means that this snippet from drivers/char/nvram.c:
 > > >
 > > > #  if defined(CONFIG_COBALT)
 > > > #include 
 > > > #define MACH COBALT
 > > > #  else
 > > > #define MACH PC
 > > > #  endif
 > > > never evaluates to true, therefore making 
 > > > fairly useless, at least under the circumstances.

 > > Maybe it should be MIPS_COBALT ?

 > that's the first thing that occurred to me, but that header file is
 > copyright sun microsystems and says nothing about MIPS, so that didn't
 > really settle the issue.  that's why i'd rather someone else resolve
 > this one way or the other.

Actually, looking through the old kernel history, it looks like this
was added by Tim Hockin's (CCed) patch "Add Cobalt Networks support to
nvram driver".  Which added this to drivers/cobalt:

+bool 'Support for Cobalt Networks x86 servers' CONFIG_COBALT

I guess Tim can clear up what's intended...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kconfig variable "COBALT" is not defined anywhere

2007-06-03 Thread Tim Hockin

I think the nvram is the only place left that uses CONFIG_COBALT

On 6/3/07, Robert P. J. Day <[EMAIL PROTECTED]> wrote:

On Sun, 3 Jun 2007, Tim Hockin wrote:

> There were other patches which added more COBALT support, but they
> were dropped or lost or whatever.
>
> I would not balk at having that code yanked.  I never got around to
> doing proper Cobalt support for modern kernels. :(
>
> On 6/3/07, Roland Dreier <[EMAIL PROTECTED]> wrote:
> >  > > >   there is no Kconfig file which defines the selectable option
> >  > > > "COBALT", which means that this snippet from drivers/char/nvram.c:
> >  > > >
> >  > > > #  if defined(CONFIG_COBALT)
> >  > > > #include 
> >  > > > #define MACH COBALT
> >  > > > #  else
> >  > > > #define MACH PC
> >  > > > #  endif
> >  > > > never evaluates to true, therefore making 
> >  > > > fairly useless, at least under the circumstances.
> >
> >  > > Maybe it should be MIPS_COBALT ?
> >
> >  > that's the first thing that occurred to me, but that header file is
> >  > copyright sun microsystems and says nothing about MIPS, so that didn't
> >  > really settle the issue.  that's why i'd rather someone else resolve
> >  > this one way or the other.
> >
> > Actually, looking through the old kernel history, it looks like this
> > was added by Tim Hockin's (CCed) patch "Add Cobalt Networks support to
> > nvram driver".  Which added this to drivers/cobalt:
> >
> > +bool 'Support for Cobalt Networks x86 servers' CONFIG_COBALT
> >
> > I guess Tim can clear up what's intended...

ok, that sounds like it might be a bigger issue than just a dead
CONFIG variable.  if that's all it is, i can submit a patch.  if it's
more than that, i'll leave it to someone higher up the food chain to
figure out what cobalt-related stuff should be yanked.

rday
--

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kconfig variable "COBALT" is not defined anywhere

2007-06-03 Thread Tim Hockin

That sounds correct.

On 6/3/07, Robert P. J. Day <[EMAIL PROTECTED]> wrote:

On Sun, 3 Jun 2007, Tim Hockin wrote:

> I think the nvram is the only place left that uses CONFIG_COBALT

sure, but once you remove this snippet near the top of
drivers/char/nvram.c:

...
#  if defined(CONFIG_COBALT)
#include 
#define MACH COBALT
#  else
#define MACH PC
#  endif
...

then everything else COBALT-related in that file should be tossed as
well, which would include stuff conditional on:

  #if MACH == COBALT

and so on.  just making sure that what you're saying is that *all*
COBALT-related content in that file can be thrown out.  i'll submit a
patch shortly and you can pass judgment.

rday
--

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: mcelog tolerant level cleanup

2007-05-18 Thread Tim Hockin

On 5/18/07, Andi Kleen <[EMAIL PROTECTED]> wrote:


>  * If RIPV is set it is not safe to restart, so set the 'no way out'
>flag rather than the 'kill it' flag.

Why? It is not PCC. We cannot return of course, but killing isn't returning.


My understanding is that the absence of RIPV indicates that it is not
safe to restart, period.  Not that the running *task* is not safe* but
that the IP on the stack is not valid to restart at all.


>  * Don't panic() on correctable MCEs.

The idea behind this was that if you get an exception it is always a bit risky
because there are a few potential deadlocks that cannot be avoided.
And normally non UC is just polled which will never cause a panic.
So I don't quite see the value of this change.


It will still always panic when tolerant == 0, and of course you're
right correctable errors would skip over the panic() path anyway.  I
can roll back the "<0" part, though I don't see the difference now :)


> This patch also calls nonseekable_open() in mce_open (as suggested by akpm).

That should be a separate patch


Andrew already sucked it into -mm - do you want me to break it out,
and re-submit?


> + 0: always panic on uncorrected errors, log corrected errors
> + 1: panic or SIGBUS on uncorrected errors, log corrected errors
> + 2: SIGBUS or log uncorrected errors, log corrected errors

Just saying SIGBUS is misleading because it isn't a catchable
signal.


should I change that to "kill" ?


Why did you remove the idle special case?


Because once the other tolerant rules are clarified, it's redundant
for tolerant < 2, and I think it's a bad special case for tolerant ==
2, and it's definately wrong for tolerant == 3.

Shall I re-roll?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_64: O_EXCL on /dev/mcelog (resend)

2007-05-07 Thread Tim Hockin
From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 /dev/mcelog is a clear-on-read interface.  It is currently possible for
 multiple users to open and read() the device.  Users are protected from
 each other during any one read, but not across reads.

Description:
 This patch adds support for O_EXCL to /dev/mcelog.  If a user opens the
 device with O_EXCL, no other user may open the device (EBUSY).  Likewise,
 any user that tries to open the device with O_EXCL while another user has
 the device will fail (EBUSY).

Result:
 Applications can get exclusive access to /dev/mcelog.  Applications that
 do not care will be unchanged.

Alternatives:
 A simpler choice would be to only allow one open() at all, regardless of
 O_EXCL.

Testing:
 I wrote an application that opens /dev/mcelog with O_EXCL and observed
 that any other app that tried to open /dev/mcelog would fail until the
 exclusive app had closed the device.

Caveats:
 None.

Patch:
 This patch is against 2.6.21.

Signed-off-by: Tim Hockin <[EMAIL PROTECTED]>

---

This is the first version version of this patch.  The simpler alternative
of only one open() sounds better to me, but becomes a net change in
behavior.


diff -pruN linux-2.6.20+th/arch/x86_64/kernel/mce.c 
linux-2.6.20+th1.5/arch/x86_64/kernel/mce.c
--- linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-27 14:19:08.0 
-0700
+++ linux-2.6.20+th1.5/arch/x86_64/kernel/mce.c 2007-05-01 21:53:10.0 
-0700
@@ -465,6 +465,40 @@ void __cpuinit mcheck_init(struct cpuinf
  * Character device to read and clear the MCE log.
  */
 
+static DEFINE_SPINLOCK(mce_state_lock);
+static int open_count; /* #times opened */
+static int open_exclu; /* already open exclusive? */
+
+static int mce_open(struct inode *inode, struct file *file)
+{
+   spin_lock(&mce_state_lock);
+
+   if (open_exclu || (open_count && (file->f_flags & O_EXCL))) {
+   spin_unlock(&mce_state_lock);
+   return -EBUSY;
+   }
+
+   if (file->f_flags & O_EXCL)
+   open_exclu = 1;
+   open_count++;
+
+   spin_unlock(&mce_state_lock);
+
+   return 0;
+}
+
+static int mce_release(struct inode *inode, struct file *file)
+{
+   spin_lock(&mce_state_lock);
+
+   open_count--;
+   open_exclu = 0;
+
+   spin_unlock(&mce_state_lock);
+
+   return 0;
+}
+
 static void collect_tscs(void *data) 
 { 
unsigned long *cpu_tsc = (unsigned long *)data;
@@ -553,6 +587,8 @@ static int mce_ioctl(struct inode *i, st
 }
 
 static const struct file_operations mce_chrdev_ops = {
+   .open = mce_open,
+   .release = mce_release,
.read = mce_read,
.ioctl = mce_ioctl,
 };
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_64: support poll() on /dev/mcelog (try #3) (resend)

2007-05-07 Thread Tim Hockin
From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 /dev/mcelog is typically polled manually.  This is less than optimal for
 situations where accurate accounting of MCEs is important.  Calling
 poll() on /dev/mcelog does not work.

Description:
 This patch adds support for poll() to /dev/mcelog.  This results in
 immediate wakeup of user apps whenever the poller finds MCEs.  Because
 the exception handler can not take any locks, it can not call the wakeup
 itself.  Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
 caught at the next return from interrupt or exit from idle, calling the
 mce_user_notify() routine.  This patch also disables the "fake panic"
 path of the mce_panic(), because it results in printk()s in the exception
 handler and crashy systems.

 This patch also does some small cleanup for essentially unused variables,
 and moves the user notification into the body of the poller, so it is
 only called once per poll, rather than once per CPU.

Result:
 Applications can now poll() on /dev/mcelog.  When an error is logged
 (whether through the poller or through an exception) the applications are
 woken up promptly.  This should not affect any previous behaviors.  If no
 MCEs are being logged, there is no overhead.

Alternatives:
 I considered simply supporting poll() through the poller and not using
 TIF_MCE_NOTIFY at all.  However, the time between an uncorrectable error
 happening and the user application being notified is *the*most* critical
 window for us.  Many uncorrectable errors can be logged to the network if
 given a chance.

 I also considered doing the MCE poll directly from the idle notifier, but
 decided that was overkill.

Testing:
 I used an error-injecting DIMM to create lots of correctable DRAM errors
 and verified that my user app is woken up in sync with the polling interval.
 I also used the northbridge to inject uncorrectable ECC errors, and
 verified (printk() to the rescue) that the notify routine is called and the
 user app does wake up.  I built with PREEMPT on and off, and verified
 that my machine survives MCEs.

Patch:
 This patch is against 2.6.21.

Signed-off-by: Tim Hockin <[EMAIL PROTECTED]>

---

This is the third version version of this patch.  The TIF_* approach was
suggested by Mike Waychison and Andi did not yell at me when I suggested
it.  Hooking the idle notifier was born of an Andrew Morton suggestion
and, no surprise, seems to work well.


diff -pruN linux-2.6.20+01_poll_interval/arch/x86_64/kernel/entry.S 
linux-2.6.20+03_poll/arch/x86_64/kernel/entry.S
--- linux-2.6.20+01_poll_interval/arch/x86_64/kernel/entry.S2007-04-24 
22:46:19.0 -0700
+++ linux-2.6.20+03_poll/arch/x86_64/kernel/entry.S 2007-05-02 
20:50:38.0 -0700
@@ -282,7 +282,7 @@ sysret_careful:
 sysret_signal:
TRACE_IRQS_ON
sti
-   testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz1f
 
/* Really a signal */
@@ -375,7 +375,7 @@ int_very_careful:
jmp int_restore_rest

 int_signal:
-   testl $(_TIF_NOTIFY_RESUME|_TIF_SIGPENDING|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_NOTIFY_RESUME|_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz 1f
movq %rsp,%rdi  # &ptregs -> arg1
xorl %esi,%esi  # oldset -> arg2
@@ -599,7 +599,7 @@ retint_careful:
jmp retint_check

 retint_signal:
-   testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jzretint_swapgs
TRACE_IRQS_ON
sti
diff -pruN linux-2.6.20+01_poll_interval/arch/x86_64/kernel/mce.c 
linux-2.6.20+03_poll/arch/x86_64/kernel/mce.c
--- linux-2.6.20+01_poll_interval/arch/x86_64/kernel/mce.c  2007-04-27 
14:19:08.0 -0700
+++ linux-2.6.20+03_poll/arch/x86_64/kernel/mce.c   2007-05-02 
21:02:16.0 -0700
@@ -20,12 +20,15 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include  
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #define MISC_MCELOG_MINOR 227
 #define NR_BANKS 6
@@ -39,8 +42,7 @@ static int mce_dont_init;
 static int tolerant = 1;
 static int banks;
 static unsigned long bank[NR_BANKS] = { [0 ... NR_BANKS-1] = ~0UL };
-static unsigned long console_logged;
-static int notify_user;
+static unsigned long notify_user;
 static int rip_msr;
 static int mce_bootlog = 1;
 static atomic_t mce_events;
@@ -48,6 +50,8 @@ static atomic_t mce_events;
 static char trigger[128];
 static char *trigger_argv[2] = { trigger, NULL };
 
+static DECLARE_WAIT_QUEUE_HEAD(mce_wait);
+
 /*
  * Lockless MCE logging infrastructure.
  * This avoids deadlocks on printk locks without having to break locks. Also
@@ -94,8 +98,7 @@ void mce_log(struct mce *mce)
mcelog.entry[entry].finished 

[PATCH] x86_64: dynamic MCE poll interval

2007-04-26 Thread Tim Hockin

From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 We've found that MCEs (specifically DRAM SBEs) tend to come in bunches,
 especially when we are trying really hard to stress the system out.  The
 current MCE poller uses a static interval which does not care whether it
 has or has not found MCEs recently.

Description:
 This patch makes the MCE poller adjust the polling interval dynamically.
 If we find an MCE, poll 2x faster (down to 10 ms).  When we stop finding
 MCEs, poll 2x slower (up to check_interval seconds).  The check_interval
 tunable becomes the max polling interval.

Result:
 If you start to take a lot of correctable errors (not exceptions), you
 log them faster and more accurately (less chance of overflowing the MCA
 registers).  If you don't take a lot of errors, you will see no change.

Alternatives:
 I considered simply reducing the polling interval to 10 ms immediately
 and keeping it there as long as we continue to find errors.  This felt a
 bit heavy handed, but does perform significantly better for the default
 check_interval of 5 minutes (we're using a few seconds when testing for
 DRAM errors).

Patch:
 This patch is against 2.6.21-rc7.

Signed-Off-By: Tim Hockin <[EMAIL PROTECTED]>o

---

diff -pruN linux-2.6.20/arch/x86_64/kernel/mce.c
linux-2.6.20+th/arch/x86_64/kernel/mce.c
--- linux-2.6.20/arch/x86_64/kernel/mce.c   2007-04-24 23:36:04.0 
-0700
+++ linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-26 10:40:29.0 
-0700
@@ -327,6 +327,7 @@ void mce_log_therm_throt_event(unsigned
 */

static int check_interval = 5 * 60; /* 5 minutes */
+static int next_interval; /* in jiffies */
static void mcheck_timer(struct work_struct *work);
static DECLARE_DELAYED_WORK(mcheck_work, mcheck_timer);

@@ -339,7 +340,6 @@ static void mcheck_check_cpu(void *info)
static void mcheck_timer(struct work_struct *work)
{
on_each_cpu(mcheck_check_cpu, NULL, 1, 1);
-   schedule_delayed_work(&mcheck_work, check_interval * HZ);

/*
 * It's ok to read stale data here for notify_user and
@@ -349,17 +349,24 @@ static void mcheck_timer(struct work_str
 * writes.
 */
if (notify_user && console_logged) {
+   /* if we logged an MCE, reduce the polling interval */
+   next_interval = max(next_interval/2, HZ/100);
notify_user = 0;
clear_bit(0, &console_logged);
printk(KERN_INFO "Machine check events logged\n");
+   } else {
+   next_interval = min(next_interval*2, check_interval*HZ);
}
+
+   schedule_delayed_work(&mcheck_work, next_interval);
}


static __init int periodic_mcheck_init(void)
{
-   if (check_interval)
-   schedule_delayed_work(&mcheck_work, check_interval*HZ);
+   next_interval = check_interval * HZ;
+   if (next_interval)
+   schedule_delayed_work(&mcheck_work, next_interval);
return 0;
}
__initcall(periodic_mcheck_init);
@@ -597,12 +604,13 @@ static int mce_resume(struct sys_device
/* Reinit MCEs after user configuration changes */
static void mce_restart(void)
{
-   if (check_interval)
+   if (next_interval)
cancel_delayed_work(&mcheck_work);
/* Timer race is harmless here */
on_each_cpu(mce_init, NULL, 1, 1);
-   if (check_interval)
-   schedule_delayed_work(&mcheck_work, check_interval*HZ);
+   next_interval = check_interval * HZ;
+   if (next_interval)
+   schedule_delayed_work(&mcheck_work, next_interval);
}

static struct sysdev_class mce_sysclass = {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: dynamic MCE poll interval

2007-04-27 Thread Tim Hockin

On 27 Apr 2007 11:09:17 +0200, Andi Kleen <[EMAIL PROTECTED]> wrote:

On Thu, Apr 26, 2007 at 06:02:52PM -0700, Tim Hockin wrote:
> Description:
>  This patch makes the MCE poller adjust the polling interval dynamically.
>  If we find an MCE, poll 2x faster (down to 10 ms).  When we stop finding
>  MCEs, poll 2x slower (up to check_interval seconds).  The check_interval
>  tunable becomes the max polling interval.

Can you please fix the documentation then?


Which documentation, specifically? :)


> Result:
>  If you start to take a lot of correctable errors (not exceptions), you
>  log them faster and more accurately (less chance of overflowing the MCA
>  registers).  If you don't take a lot of errors, you will see no change.

Makes sense.

AMD RevF can do this using the threshold interrupts too for DIMM errors
too without any delays -- perhaps it would also make sense to configure
this by default that it always triggers on all DIMM errors.
Right now it is just an option in /sys


Can I look at this as a followon patch?  I have a number of mce
related patches in the pipeline, and I am trying to keep them small
for testing sanity - they are hard enough to test :)


The printk should not happen too often. Can you add some hardcoded
limit there than it doesn't happen more often than every hour or so
(or perhaps use a exponential backoff here too?)
It is only to tell users to check mcelog output.


Sure.  I'll fix it up and hit you again today.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: dynamic MCE poll interval

2007-04-27 Thread Tim Hockin

On 27 Apr 2007 19:02:30 +0200, Andi Kleen <[EMAIL PROTECTED]> wrote:

On Fri, Apr 27, 2007 at 09:58:14AM -0700, Tim Hockin wrote:
> On 27 Apr 2007 11:09:17 +0200, Andi Kleen <[EMAIL PROTECTED]> wrote:
> >On Thu, Apr 26, 2007 at 06:02:52PM -0700, Tim Hockin wrote:
> >> Description:
> >>  This patch makes the MCE poller adjust the polling interval dynamically.
> >>  If we find an MCE, poll 2x faster (down to 10 ms).  When we stop finding
> >>  MCEs, poll 2x slower (up to check_interval seconds).  The check_interval
> >>  tunable becomes the max polling interval.
> >
> >Can you please fix the documentation then?
>
> Which documentation, specifically? :)

Documentation/x86_64/{boot-options.txt,machinecheck}


What needs to be changed in boot-options?



>
> >> Result:
> >>  If you start to take a lot of correctable errors (not exceptions), you
> >>  log them faster and more accurately (less chance of overflowing the MCA
> >>  registers).  If you don't take a lot of errors, you will see no change.
> >
> >Makes sense.
> >
> >AMD RevF can do this using the threshold interrupts too for DIMM errors
> >too without any delays -- perhaps it would also make sense to configure
> >this by default that it always triggers on all DIMM errors.
> >Right now it is just an option in /sys
>
> Can I look at this as a followon patch?  I have a number of mce

Sure.

-Andi


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: dynamic MCE poll interval

2007-04-27 Thread Tim Hockin

From: Tim Hockin <[EMAIL PROTECTED]>

Background:
We've found that MCEs (specifically DRAM SBEs) tend to come in bunches,
especially when we are trying really hard to stress the system out.  The
current MCE poller uses a static interval which does not care whether it
has or has not found MCEs recently.

Description:
This patch makes the MCE poller adjust the polling interval dynamically.
If we find an MCE, poll 2x faster (down to 10 ms).  When we stop finding
MCEs, poll 2x slower (up to check_interval seconds).  The check_interval
tunable becomes the max polling interval.  The "Machine check events
logged" printk() is rate limited to the check_interval, which should be
identical behavior to the old functionality.

Result:
If you start to take a lot of correctable errors (not exceptions), you
log them faster and more accurately (less chance of overflowing the MCA
registers).  If you don't take a lot of errors, you will see no change.

Alternatives:
I considered simply reducing the polling interval to 10 ms immediately
and keeping it there as long as we continue to find errors.  This felt a
bit heavy handed, but does perform significantly better for the default
check_interval of 5 minutes (we're using a few seconds when testing for
DRAM errors).  I could be convinced to go with this, if anyone felt it
was not too aggressive.

Testing:
I used an error-injecting DIMM to create lots of correctable DRAM errors
and verified that the polling interval accelerates.  The printk() only
happens once per check_interval seconds.

Patch:
This patch is against 2.6.21-rc7.

Signed-Off-By: Tim Hockin <[EMAIL PROTECTED]>

---

This is the second version of this patch.  The more I think about the
alternative described above, the more I like it, but I recognize that
it is heavy-handed.  If correctable errors matter that much to you,
you can start with a small check_interval, I suppose.


diff -pruN linux-2.6.20/Documentation/x86_64/machinecheck
linux-2.6.20+th/Documentation/x86_64/machinecheck
--- linux-2.6.20/Documentation/x86_64/machinecheck  2007-04-24
23:36:03.0 -0700
+++ linux-2.6.20+th/Documentation/x86_64/machinecheck   2007-04-27
10:11:10.0 -0700
@@ -36,7 +36,12 @@ between all CPUs.

check_interval
How often to poll for corrected machine check errors, in seconds
-   (Note output is hexademical). Default 5 minutes.
+   (Note output is hexademical). Default 5 minutes.  When the poller
+   finds MCEs it triggers an exponential speedup (poll more often) on
+   the polling interval.  When the poller stops finding MCEs, it
+   triggers an exponential backoff (poll less often) on the polling
+   interval. The check_interval variable is both the initial and
+   maximum polling interval.

tolerant
Tolerance level. When a machine check exception occurs for a non
diff -pruN linux-2.6.20/arch/x86_64/kernel/mce.c
linux-2.6.20+th/arch/x86_64/kernel/mce.c
--- linux-2.6.20/arch/x86_64/kernel/mce.c   2007-04-27 10:01:02.0 
-0700
+++ linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-27 10:41:02.0 
-0700
@@ -323,10 +323,13 @@ void mce_log_therm_throt_event(unsigned
#endif /* CONFIG_X86_MCE_INTEL */

/*
- * Periodic polling timer for "silent" machine check errors.
+ * Periodic polling timer for "silent" machine check errors.  If the
+ * poller finds an MCE, poll 2x faster.  When the poller finds no more
+ * errors, poll 2x slower (up to check_interval seconds).
 */

static int check_interval = 5 * 60; /* 5 minutes */
+static int next_interval; /* in jiffies */
static void mcheck_timer(struct work_struct *work);
static DECLARE_DELAYED_WORK(mcheck_work, mcheck_timer);

@@ -339,7 +342,6 @@ static void mcheck_check_cpu(void *info)
static void mcheck_timer(struct work_struct *work)
{
on_each_cpu(mcheck_check_cpu, NULL, 1, 1);
-   schedule_delayed_work(&mcheck_work, check_interval * HZ);

/*
 * It's ok to read stale data here for notify_user and
@@ -349,17 +351,30 @@ static void mcheck_timer(struct work_str
 * writes.
 */
if (notify_user && console_logged) {
+   static unsigned long last_print = 0;
+   unsigned long now = jiffies;
+
+   /* if we logged an MCE, reduce the polling interval */
+   next_interval = max(next_interval/2, HZ/100);
notify_user = 0;
clear_bit(0, &console_logged);
-   printk(KERN_INFO "Machine check events logged\n");
+   if ((now - last_print) >= check_interval*HZ) {
+   last_print = now;
+   printk(KERN_INFO "Machine check events logged\n");
+   }
+   } else {
+   next_interval = min(next_interval*2, check_interval*HZ);
}
+
+   schedule_delayed_work(&mcheck_work, next_interval);
}


static 

Re: [PATCH] x86_64: dynamic MCE poll interval

2007-04-27 Thread Tim Hockin

From: Tim Hockin <[EMAIL PROTECTED]>

Background:
We've found that MCEs (specifically DRAM SBEs) tend to come in bunches,
especially when we are trying really hard to stress the system out.  The
current MCE poller uses a static interval which does not care whether it
has or has not found MCEs recently.

Description:
This patch makes the MCE poller adjust the polling interval dynamically.
If we find an MCE, poll 2x faster (down to 10 ms).  When we stop finding
MCEs, poll 2x slower (up to check_interval seconds).  The check_interval
tunable becomes the max polling interval.  The "Machine check events
logged" printk() is rate limited to the check_interval, which should be
identical behavior to the old functionality.

Result:
If you start to take a lot of correctable errors (not exceptions), you
log them faster and more accurately (less chance of overflowing the MCA
registers).  If you don't take a lot of errors, you will see no change.

Alternatives:
I considered simply reducing the polling interval to 10 ms immediately
and keeping it there as long as we continue to find errors.  This felt a
bit heavy handed, but does perform significantly better for the default
check_interval of 5 minutes (we're using a few seconds when testing for
DRAM errors).  I could be convinced to go with this, if anyone felt it
was not too aggressive.

Testing:
I used an error-injecting DIMM to create lots of correctable DRAM errors
and verified that the polling interval accelerates.  The printk() only
happens once per check_interval seconds.

Patch:
This patch is against 2.6.21-rc7.

Signed-Off-By: Tim Hockin <[EMAIL PROTECTED]>

---

This is the third version of this patch.  The only change from the prior
version is to use time_after_eq().

diff -pruN linux-2.6.20/Documentation/x86_64/machinecheck
linux-2.6.20+th/Documentation/x86_64/machinecheck
--- linux-2.6.20/Documentation/x86_64/machinecheck  2007-04-24
23:36:03.0 -0700
+++ linux-2.6.20+th/Documentation/x86_64/machinecheck   2007-04-27
10:11:10.0 -0700
@@ -36,7 +36,12 @@ between all CPUs.

check_interval
How often to poll for corrected machine check errors, in seconds
-   (Note output is hexademical). Default 5 minutes.
+   (Note output is hexademical). Default 5 minutes.  When the poller
+   finds MCEs it triggers an exponential speedup (poll more often) on
+   the polling interval.  When the poller stops finding MCEs, it
+   triggers an exponential backoff (poll less often) on the polling
+   interval. The check_interval variable is both the initial and
+   maximum polling interval.

tolerant
Tolerance level. When a machine check exception occurs for a non
diff -pruN linux-2.6.20/arch/x86_64/kernel/mce.c
linux-2.6.20+th/arch/x86_64/kernel/mce.c
--- linux-2.6.20/arch/x86_64/kernel/mce.c   2007-04-27 10:01:02.0 
-0700
+++ linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-27 10:41:02.0 
-0700
@@ -323,10 +323,13 @@ void mce_log_therm_throt_event(unsigned
#endif /* CONFIG_X86_MCE_INTEL */

/*
- * Periodic polling timer for "silent" machine check errors.
+ * Periodic polling timer for "silent" machine check errors.  If the
+ * poller finds an MCE, poll 2x faster.  When the poller finds no more
+ * errors, poll 2x slower (up to check_interval seconds).
 */

static int check_interval = 5 * 60; /* 5 minutes */
+static int next_interval; /* in jiffies */
static void mcheck_timer(struct work_struct *work);
static DECLARE_DELAYED_WORK(mcheck_work, mcheck_timer);

@@ -339,7 +342,6 @@ static void mcheck_check_cpu(void *info)
static void mcheck_timer(struct work_struct *work)
{
on_each_cpu(mcheck_check_cpu, NULL, 1, 1);
-   schedule_delayed_work(&mcheck_work, check_interval * HZ);

/*
 * It's ok to read stale data here for notify_user and
@@ -349,17 +351,30 @@ static void mcheck_timer(struct work_str
 * writes.
 */
if (notify_user && console_logged) {
+   static unsigned long last_print = 0;
+   unsigned long now = jiffies;
+
+   /* if we logged an MCE, reduce the polling interval */
+   next_interval = max(next_interval/2, HZ/100);
notify_user = 0;
clear_bit(0, &console_logged);
-   printk(KERN_INFO "Machine check events logged\n");
+   if (time_after_eq(now, last_print + (check_interval*HZ))) {
+   last_print = now;
+   printk(KERN_INFO "Machine check events logged\n");
+   }
+   } else {
+   next_interval = min(next_interval*2, check_interval*HZ);
}
+
+   schedule_delayed_work(&mcheck_work, next_interval);
}


static __init int periodic_mcheck_init(void)
{
-   if (check_interval)
-   schedule_delayed_work(&mcheck_work, check_interval*HZ);
+  

Re: [PATCH] x86_64: dynamic MCE poll interval

2007-04-27 Thread Tim Hockin

Sorry, Gmail mangles whitespace unless you do just the right thing.
Let me work around it.  Proper patch coming.


On 4/27/07, Tim Hockin <[EMAIL PROTECTED]> wrote:

From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 We've found that MCEs (specifically DRAM SBEs) tend to come in bunches,
 especially when we are trying really hard to stress the system out.  The
 current MCE poller uses a static interval which does not care whether it
 has or has not found MCEs recently.

Description:
 This patch makes the MCE poller adjust the polling interval dynamically.
 If we find an MCE, poll 2x faster (down to 10 ms).  When we stop finding
 MCEs, poll 2x slower (up to check_interval seconds).  The check_interval
 tunable becomes the max polling interval.  The "Machine check events
 logged" printk() is rate limited to the check_interval, which should be
 identical behavior to the old functionality.

Result:
 If you start to take a lot of correctable errors (not exceptions), you
 log them faster and more accurately (less chance of overflowing the MCA
 registers).  If you don't take a lot of errors, you will see no change.

Alternatives:
 I considered simply reducing the polling interval to 10 ms immediately
 and keeping it there as long as we continue to find errors.  This felt a
 bit heavy handed, but does perform significantly better for the default
 check_interval of 5 minutes (we're using a few seconds when testing for
 DRAM errors).  I could be convinced to go with this, if anyone felt it
 was not too aggressive.

Testing:
 I used an error-injecting DIMM to create lots of correctable DRAM errors
 and verified that the polling interval accelerates.  The printk() only
 happens once per check_interval seconds.

Patch:
 This patch is against 2.6.21-rc7.

Signed-Off-By: Tim Hockin <[EMAIL PROTECTED]>

---

This is the third version of this patch.  The only change from the prior
version is to use time_after_eq().

diff -pruN linux-2.6.20/Documentation/x86_64/machinecheck
linux-2.6.20+th/Documentation/x86_64/machinecheck
--- linux-2.6.20/Documentation/x86_64/machinecheck  2007-04-24
23:36:03.0 -0700
+++ linux-2.6.20+th/Documentation/x86_64/machinecheck   2007-04-27
10:11:10.0 -0700
@@ -36,7 +36,12 @@ between all CPUs.

 check_interval
How often to poll for corrected machine check errors, in seconds
-   (Note output is hexademical). Default 5 minutes.
+   (Note output is hexademical). Default 5 minutes.  When the poller
+   finds MCEs it triggers an exponential speedup (poll more often) on
+   the polling interval.  When the poller stops finding MCEs, it
+   triggers an exponential backoff (poll less often) on the polling
+   interval. The check_interval variable is both the initial and
+   maximum polling interval.

 tolerant
Tolerance level. When a machine check exception occurs for a non
diff -pruN linux-2.6.20/arch/x86_64/kernel/mce.c
linux-2.6.20+th/arch/x86_64/kernel/mce.c
--- linux-2.6.20/arch/x86_64/kernel/mce.c   2007-04-27 10:01:02.0 
-0700
+++ linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-27 10:41:02.0 
-0700
@@ -323,10 +323,13 @@ void mce_log_therm_throt_event(unsigned
 #endif /* CONFIG_X86_MCE_INTEL */

 /*
- * Periodic polling timer for "silent" machine check errors.
+ * Periodic polling timer for "silent" machine check errors.  If the
+ * poller finds an MCE, poll 2x faster.  When the poller finds no more
+ * errors, poll 2x slower (up to check_interval seconds).
  */

 static int check_interval = 5 * 60; /* 5 minutes */
+static int next_interval; /* in jiffies */
 static void mcheck_timer(struct work_struct *work);
 static DECLARE_DELAYED_WORK(mcheck_work, mcheck_timer);

@@ -339,7 +342,6 @@ static void mcheck_check_cpu(void *info)
 static void mcheck_timer(struct work_struct *work)
 {
on_each_cpu(mcheck_check_cpu, NULL, 1, 1);
-   schedule_delayed_work(&mcheck_work, check_interval * HZ);

/*
 * It's ok to read stale data here for notify_user and
@@ -349,17 +351,30 @@ static void mcheck_timer(struct work_str
 * writes.
 */
if (notify_user && console_logged) {
+   static unsigned long last_print = 0;
+   unsigned long now = jiffies;
+
+   /* if we logged an MCE, reduce the polling interval */
+   next_interval = max(next_interval/2, HZ/100);
notify_user = 0;
clear_bit(0, &console_logged);
-   printk(KERN_INFO "Machine check events logged\n");
+   if (time_after_eq(now, last_print + (check_interval*HZ))) {
+   last_print = now;
+   printk(KERN_INFO "Machine check events logged\n");
+   }
+   } else {
+   next_interval = min(next_interval*2, check_interval*HZ);
}
+
+   schedule_del

[PATCH] x86_64: dynamic MCE poll interval (try 2)

2007-04-27 Thread Tim Hockin

From: Tim Hockin <[EMAIL PROTECTED]>

Background:
We've found that MCEs (specifically DRAM SBEs) tend to come in bunches,
especially when we are trying really hard to stress the system out.  The
current MCE poller uses a static interval which does not care whether it
has or has not found MCEs recently.

Description:
This patch makes the MCE poller adjust the polling interval dynamically.
If we find an MCE, poll 2x faster (down to 10 ms).  When we stop finding
MCEs, poll 2x slower (up to check_interval seconds).  The check_interval
tunable becomes the max polling interval.  The "Machine check events
logged" printk() is rate limited to the check_interval, which should be
identical behavior to the old functionality.

Result:
If you start to take a lot of correctable errors (not exceptions), you
log them faster and more accurately (less chance of overflowing the MCA
registers).  If you don't take a lot of errors, you will see no change.

Alternatives:
I considered simply reducing the polling interval to 10 ms immediately
and keeping it there as long as we continue to find errors.  This felt a
bit heavy handed, but does perform significantly better for the default
check_interval of 5 minutes (we're using a few seconds when testing for
DRAM errors).  I could be convinced to go with this, if anyone felt it
was not too aggressive.

Testing:
I used an error-injecting DIMM to create lots of correctable DRAM errors
and verified that the polling interval accelerates.  The printk() only
happens once per check_interval seconds.

Patch:
This patch is against 2.6.21-rc7.

Signed-Off-By: Tim Hockin <[EMAIL PROTECTED]>

---

This is the fourth version of this patch.  The only change from the prior
version is to not initialize a static var to 0.

This patch should come through cleanly, I hope.


diff -pruN linux-2.6.20/Documentation/x86_64/machinecheck
linux-2.6.20+th/Documentation/x86_64/machinecheck
--- linux-2.6.20/Documentation/x86_64/machinecheck  2007-04-24
23:36:03.0 -0700
+++ linux-2.6.20+th/Documentation/x86_64/machinecheck   2007-04-27
10:11:10.0 -0700
@@ -36,7 +36,12 @@ between all CPUs.

check_interval
How often to poll for corrected machine check errors, in seconds
-   (Note output is hexademical). Default 5 minutes.
+   (Note output is hexademical). Default 5 minutes.  When the poller
+   finds MCEs it triggers an exponential speedup (poll more often) on
+   the polling interval.  When the poller stops finding MCEs, it
+   triggers an exponential backoff (poll less often) on the polling
+   interval. The check_interval variable is both the initial and
+   maximum polling interval.

tolerant
Tolerance level. When a machine check exception occurs for a non
diff -pruN linux-2.6.20/arch/x86_64/kernel/mce.c
linux-2.6.20+th/arch/x86_64/kernel/mce.c
--- linux-2.6.20/arch/x86_64/kernel/mce.c   2007-04-27 10:01:02.0 
-0700
+++ linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-27 10:41:02.0 
-0700
@@ -323,10 +323,13 @@ void mce_log_therm_throt_event(unsigned
#endif /* CONFIG_X86_MCE_INTEL */

/*
- * Periodic polling timer for "silent" machine check errors.
+ * Periodic polling timer for "silent" machine check errors.  If the
+ * poller finds an MCE, poll 2x faster.  When the poller finds no more
+ * errors, poll 2x slower (up to check_interval seconds).
 */

static int check_interval = 5 * 60; /* 5 minutes */
+static int next_interval; /* in jiffies */
static void mcheck_timer(struct work_struct *work);
static DECLARE_DELAYED_WORK(mcheck_work, mcheck_timer);

@@ -339,7 +342,6 @@ static void mcheck_check_cpu(void *info)
static void mcheck_timer(struct work_struct *work)
{
on_each_cpu(mcheck_check_cpu, NULL, 1, 1);
-   schedule_delayed_work(&mcheck_work, check_interval * HZ);

/*
 * It's ok to read stale data here for notify_user and
@@ -349,17 +351,30 @@ static void mcheck_timer(struct work_str
 * writes.
 */
if (notify_user && console_logged) {
+   static unsigned long last_print;
+   unsigned long now = jiffies;
+
+   /* if we logged an MCE, reduce the polling interval */
+   next_interval = max(next_interval/2, HZ/100);
notify_user = 0;
clear_bit(0, &console_logged);
-   printk(KERN_INFO "Machine check events logged\n");
+   if (time_after_eq(now, last_print + (check_interval*HZ))) {
+   last_print = now;
+   printk(KERN_INFO "Machine check events logged\n");
+   }
+   } else {
+   next_interval = min(next_interval*2, check_interval*HZ);
}
+
+   schedule_delayed_work(&mcheck_work, next_interval);
}


static __init int periodic_mcheck_init(void)
{
-   if (check_interval)
-   schedule_

[PATCH] x86_64: support poll() on /dev/mcelog

2007-04-29 Thread Tim Hockin
From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 /dev/mcelog is typically polled manually.  This is less than optimal for
 some situations.  Calling poll() on /dev/mcelog does not work.

Description:
 This patch adds support for poll() to /dev/mcelog.  This results in
 immediate wakeup of user apps whenever the poller finds MCEs.  Because
 the exception handler can not take any locks, it can not call the wakeup
 itself.  Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
 caught at the next return from interrupt, calling the mce_user_notify()
 routine.

 This patch also does some small cleanup for essentially unused variables,
 and moves the user notification into the body of the poller, so it is
 only called once, rather than once per CPU.

Result:
 Applications can now poll() on /dev/mcelog.  When an error is logged
 (whether through the poller or through an exception) the applications are
 woken up promptly.  This should not affect any previous behaviors.  If no
 MCEs are being logged, there is no overhead.

Alternatives:
 I considered simply supporting poll() through the poller and not using
 TIF_MCE_NOTIFY at all.  However, the time between an uncorrectable error
 happening and the user application being notified is *the*most* critical
 window for us.  Many uncorrectable errors can be logged to the network if
 given a chance.

Testing:
 I used an error-injecting DIMM to create lots of correctable DRAM errors
 and verified that my user app is woken up in sync with the polling interval.
 I also used the northbridge to inject uncorrectable ECC errors, and
 verified (printk() to the rescue) that the notify routine is called and the
 user app does wake up.

Patch:
 This patch is against 2.6.21-rc7.

Signed-Off-By: Tim Hockin <[EMAIL PROTECTED]>

---

This is the first public version version of this patch.  The TIF_*
approach was suggested by Mike Waychison and Andi did not yell at me when
I suggested it. :)


diff -pruN linux-2.6.20+th/arch/x86_64/kernel/entry.S 
linux-2.6.20+th2v2/arch/x86_64/kernel/entry.S
--- linux-2.6.20+th/arch/x86_64/kernel/entry.S  2007-04-24 22:46:19.0 
-0700
+++ linux-2.6.20+th2v2/arch/x86_64/kernel/entry.S   2007-04-29 
18:10:40.0 -0700
@@ -585,7 +585,7 @@ bad_iret:
 retint_careful:
CFI_RESTORE_STATE
bt$TIF_NEED_RESCHED,%edx
-   jnc   retint_signal
+   jnc   retint_mce
TRACE_IRQS_ON
sti
pushq %rdi
@@ -597,7 +597,23 @@ retint_careful:
cli
TRACE_IRQS_OFF
jmp retint_check
-   
+
+   /* handle a waiting machine check */
+retint_mce:
+   bt $TIF_MCE_NOTIFY,%edx
+   jnc retint_signal
+   TRACE_IRQS_ON
+   sti
+   pushq %rdi
+   CFI_ADJUST_CFA_OFFSET   8
+   call mce_notify_user
+   popq %rdi   
+   CFI_ADJUST_CFA_OFFSET   -8
+   GET_THREAD_INFO(%rcx)
+   cli
+   TRACE_IRQS_OFF
+   jmp retint_check
+
 retint_signal:
testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
jzretint_swapgs
diff -pruN linux-2.6.20+th/arch/x86_64/kernel/mce.c 
linux-2.6.20+th2v2/arch/x86_64/kernel/mce.c
--- linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-27 14:19:08.0 
-0700
+++ linux-2.6.20+th2v2/arch/x86_64/kernel/mce.c 2007-04-29 18:27:22.0 
-0700
@@ -20,6 +20,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include  
 #include 
 #include 
@@ -39,8 +41,7 @@ static int mce_dont_init;
 static int tolerant = 1;
 static int banks;
 static unsigned long bank[NR_BANKS] = { [0 ... NR_BANKS-1] = ~0UL };
-static unsigned long console_logged;
-static int notify_user;
+static unsigned long notify_user;
 static int rip_msr;
 static int mce_bootlog = 1;
 static atomic_t mce_events;
@@ -48,6 +49,8 @@ static atomic_t mce_events;
 static char trigger[128];
 static char *trigger_argv[2] = { trigger, NULL };
 
+static DECLARE_WAIT_QUEUE_HEAD(mce_wait);
+
 /*
  * Lockless MCE logging infrastructure.
  * This avoids deadlocks on printk locks without having to break locks. Also
@@ -94,8 +97,7 @@ void mce_log(struct mce *mce)
mcelog.entry[entry].finished = 1;
wmb();
 
-   if (!test_and_set_bit(0, &console_logged))
-   notify_user = 1;
+   set_bit(0, ¬ify_user);
 }
 
 static void print_mce(struct mce *m)
@@ -167,15 +169,19 @@ static inline void mce_get_rip(struct mc
}
 }
 
-static void do_mce_trigger(void)
+/*
+ * This is only called in normal interrupt context.  This is where we do
+ * anything we need to alert userspace.  This is called directly from the
+ * poller and also from entry.S thanks to TIF_MCE_NOTIFY.
+ */
+void mce_notify_user(void)
 {
-   static atomic_t mce_logged;
-   int events = atomic_read(&mce_events);
-   if (events != atomic_read(&mce_logged) && trigger[0]) {
-   /* Small race window, but should be harmless.  */
-   atomic_set(&mce_logged

Re: [PATCH] x86_64: support poll() on /dev/mcelog

2007-04-29 Thread Tim Hockin

Crap - I have just got soft lockup detected.  Back to debug.

Call Trace:
  [] wake_up_process+0x10/0x20
 [] softlockup_tick+0xea/0x110
 [] run_local_timers+0x13/0x20
 [] update_process_times+0x57/0x90
 [] mcheck_check_cpu+0x0/0x40
 [] smp_local_timer_interrupt+0x34/0x60
 [] smp_apic_timer_interrupt+0x4e/0x70
 [] apic_timer_interrupt+0x66/0x70


On 4/29/07, Tim Hockin <[EMAIL PROTECTED]> wrote:

From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 /dev/mcelog is typically polled manually.  This is less than optimal for
 some situations.  Calling poll() on /dev/mcelog does not work.

Description:
 This patch adds support for poll() to /dev/mcelog.  This results in
 immediate wakeup of user apps whenever the poller finds MCEs.  Because
 the exception handler can not take any locks, it can not call the wakeup
 itself.  Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
 caught at the next return from interrupt, calling the mce_user_notify()
 routine.

 This patch also does some small cleanup for essentially unused variables,
 and moves the user notification into the body of the poller, so it is
 only called once, rather than once per CPU.

Result:
 Applications can now poll() on /dev/mcelog.  When an error is logged
 (whether through the poller or through an exception) the applications are
 woken up promptly.  This should not affect any previous behaviors.  If no
 MCEs are being logged, there is no overhead.

Alternatives:
 I considered simply supporting poll() through the poller and not using
 TIF_MCE_NOTIFY at all.  However, the time between an uncorrectable error
 happening and the user application being notified is *the*most* critical
 window for us.  Many uncorrectable errors can be logged to the network if
 given a chance.

Testing:
 I used an error-injecting DIMM to create lots of correctable DRAM errors
 and verified that my user app is woken up in sync with the polling interval.
 I also used the northbridge to inject uncorrectable ECC errors, and
 verified (printk() to the rescue) that the notify routine is called and the
 user app does wake up.

Patch:
 This patch is against 2.6.21-rc7.

Signed-Off-By: Tim Hockin <[EMAIL PROTECTED]>

---

This is the first public version version of this patch.  The TIF_*
approach was suggested by Mike Waychison and Andi did not yell at me when
I suggested it. :)


diff -pruN linux-2.6.20+th/arch/x86_64/kernel/entry.S 
linux-2.6.20+th2v2/arch/x86_64/kernel/entry.S
--- linux-2.6.20+th/arch/x86_64/kernel/entry.S  2007-04-24 22:46:19.0 
-0700
+++ linux-2.6.20+th2v2/arch/x86_64/kernel/entry.S   2007-04-29 
18:10:40.0 -0700
@@ -585,7 +585,7 @@ bad_iret:
 retint_careful:
CFI_RESTORE_STATE
bt$TIF_NEED_RESCHED,%edx
-   jnc   retint_signal
+   jnc   retint_mce
TRACE_IRQS_ON
sti
pushq %rdi
@@ -597,7 +597,23 @@ retint_careful:
cli
TRACE_IRQS_OFF
jmp retint_check
-
+
+   /* handle a waiting machine check */
+retint_mce:
+   bt $TIF_MCE_NOTIFY,%edx
+   jnc retint_signal
+   TRACE_IRQS_ON
+   sti
+   pushq %rdi
+   CFI_ADJUST_CFA_OFFSET   8
+   call mce_notify_user
+   popq %rdi
+   CFI_ADJUST_CFA_OFFSET   -8
+   GET_THREAD_INFO(%rcx)
+   cli
+   TRACE_IRQS_OFF
+   jmp retint_check
+
 retint_signal:
testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
jzretint_swapgs
diff -pruN linux-2.6.20+th/arch/x86_64/kernel/mce.c 
linux-2.6.20+th2v2/arch/x86_64/kernel/mce.c
--- linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-27 14:19:08.0 
-0700
+++ linux-2.6.20+th2v2/arch/x86_64/kernel/mce.c 2007-04-29 18:27:22.0 
-0700
@@ -20,6 +20,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -39,8 +41,7 @@ static int mce_dont_init;
 static int tolerant = 1;
 static int banks;
 static unsigned long bank[NR_BANKS] = { [0 ... NR_BANKS-1] = ~0UL };
-static unsigned long console_logged;
-static int notify_user;
+static unsigned long notify_user;
 static int rip_msr;
 static int mce_bootlog = 1;
 static atomic_t mce_events;
@@ -48,6 +49,8 @@ static atomic_t mce_events;
 static char trigger[128];
 static char *trigger_argv[2] = { trigger, NULL };

+static DECLARE_WAIT_QUEUE_HEAD(mce_wait);
+
 /*
  * Lockless MCE logging infrastructure.
  * This avoids deadlocks on printk locks without having to break locks. Also
@@ -94,8 +97,7 @@ void mce_log(struct mce *mce)
mcelog.entry[entry].finished = 1;
wmb();

-   if (!test_and_set_bit(0, &console_logged))
-   notify_user = 1;
+   set_bit(0, ¬ify_user);
 }

 static void print_mce(struct mce *m)
@@ -167,15 +169,19 @@ static inline void mce_get_rip(struct mc
}
 }

-static void do_mce_trigger(void)
+/*
+ * This is only called in normal interrupt context.  This is where we do
+ * anything we need to alert userspace.  This is called directly from t

[PATCH] x86_64: support poll() on /dev/mcelog (try #2)

2007-04-30 Thread Tim Hockin
From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 /dev/mcelog is typically polled manually.  This is less than optimal for
 situations where accurate accounting of MCEs is important.  Calling
 poll() on /dev/mcelog does not work.

Description:
 This patch adds support for poll() to /dev/mcelog.  This results in
 immediate wakeup of user apps whenever the poller finds MCEs.  Because
 the exception handler can not take any locks, it can not call the wakeup
 itself.  Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
 caught at the next return from interrupt or exit from idle, calling the
 mce_user_notify() routine.

 This patch also does some small cleanup for essentially unused variables,
 and moves the user notification into the body of the poller, so it is
 only called once per poll, rather than once per CPU.

Result:
 Applications can now poll() on /dev/mcelog.  When an error is logged
 (whether through the poller or through an exception) the applications are
 woken up promptly.  This should not affect any previous behaviors.  If no
 MCEs are being logged, there is no overhead.

Alternatives:
 I considered simply supporting poll() through the poller and not using
 TIF_MCE_NOTIFY at all.  However, the time between an uncorrectable error
 happening and the user application being notified is *the*most* critical
 window for us.  Many uncorrectable errors can be logged to the network if
 given a chance.

 I also considered doing the MCE poll directly from the idle notifier, but
 decided that was overkill.

Testing:
 I used an error-injecting DIMM to create lots of correctable DRAM errors
 and verified that my user app is woken up in sync with the polling interval.
 I also used the northbridge to inject uncorrectable ECC errors, and
 verified (printk() to the rescue) that the notify routine is called and the
 user app does wake up.

Caveats:
 I have seen a soft lockup with a call trace always similar to:
Call Trace:
   [] wake_up_process+0x10/0x20
[] softlockup_tick+0xea/0x110
[] run_local_timers+0x13/0x20
[] update_process_times+0x57/0x90
[] mcheck_check_cpu+0x0/0x40
[] smp_local_timer_interrupt+0x34/0x60
[] smp_apic_timer_interrupt+0x4e/0x70
[] apic_timer_interrupt+0x66/0x70

 I regressed this to the vanilla kernel, and it still happens.  It only
 crops up in the face of multiple uncorrectable errors.

Patch:
 This patch is against 2.6.21-rc7.

Signed-off-by: Tim Hockin <[EMAIL PROTECTED]>

---

This is the second version version of this patch.  The TIF_* approach was
suggested by Mike Waychison and Andi did not yell at me when I suggested
it.  Hooking the idle notifier was born of an Andrew Morton suggestion
and, no surprise, seems to work well.


diff -pruN linux-2.6.20+th/arch/x86_64/kernel/entry.S 
linux-2.6.20+th2v3/arch/x86_64/kernel/entry.S
--- linux-2.6.20+th/arch/x86_64/kernel/entry.S  2007-04-24 22:46:19.0 
-0700
+++ linux-2.6.20+th2v3/arch/x86_64/kernel/entry.S   2007-04-30 
10:57:43.0 -0700
@@ -282,7 +282,7 @@ sysret_careful:
 sysret_signal:
TRACE_IRQS_ON
sti
-   testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz1f
 
/* Really a signal */
@@ -375,7 +375,7 @@ int_very_careful:
jmp int_restore_rest

 int_signal:
-   testl $(_TIF_NOTIFY_RESUME|_TIF_SIGPENDING|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_NOTIFY_RESUME|_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz 1f
movq %rsp,%rdi  # &ptregs -> arg1
xorl %esi,%esi  # oldset -> arg2
@@ -597,9 +597,9 @@ retint_careful:
cli
TRACE_IRQS_OFF
jmp retint_check
-   
+
 retint_signal:
-   testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jzretint_swapgs
TRACE_IRQS_ON
sti
diff -pruN linux-2.6.20+th/arch/x86_64/kernel/mce.c 
linux-2.6.20+th2v3/arch/x86_64/kernel/mce.c
--- linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-27 14:19:08.0 
-0700
+++ linux-2.6.20+th2v3/arch/x86_64/kernel/mce.c 2007-04-30 22:19:25.0 
-0700
@@ -20,12 +20,15 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include  
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #define MISC_MCELOG_MINOR 227
 #define NR_BANKS 6
@@ -39,8 +42,7 @@ static int mce_dont_init;
 static int tolerant = 1;
 static int banks;
 static unsigned long bank[NR_BANKS] = { [0 ... NR_BANKS-1] = ~0UL };
-static unsigned long console_logged;
-static int notify_user;
+static unsigned long notify_user;
 static int rip_msr;
 static int mce_bootlog = 1;
 static atomic_t mce_events;
@@ -48,6 +50,8 @@ static atomic_t mc

[PATCH] x86_64: O_EXCL on /dev/mcelog

2007-05-01 Thread Tim Hockin
From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 /dev/mcelog is a clear-on-read interface.  It is currently possible for
 multiple users to open and read() the device.  Users are protected from
 each other during any one read, but not across reads.

Description:
 This patch adds support for O_EXCL to /dev/mcelog.  If a user opens the
 device with O_EXCL, no other user may open the device (EBUSY).  Likewise,
 any user that tries to open the device with O_EXCL while another user has
 the device will fail (EBUSY).

Result:
 Applications can get exclusive access to /dev/mcelog.  Applications that
 do not care will be unchanged.

Alternatives:
 A simpler choice would be to only allow one open() at all, regardless of
 O_EXCL.

Testing:
 I wrote an application that opens /dev/mcelog with O_EXCL and observed
 that any other app that tried to open /dev/mcelog would fail until the
 exclusive app had closed the device.

Caveats:
 None.

Patch:
 This patch is against 2.6.21-rc7.

Signed-off-by: Tim Hockin <[EMAIL PROTECTED]>

---

This is the first version version of this patch.  The simpler alternative
of only one open() sounds better to me, but becomes a net change in
behavior.


diff -pruN linux-2.6.20+th/arch/x86_64/kernel/mce.c 
linux-2.6.20+th1.5/arch/x86_64/kernel/mce.c
--- linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-27 14:19:08.0 
-0700
+++ linux-2.6.20+th1.5/arch/x86_64/kernel/mce.c 2007-05-01 21:53:10.0 
-0700
@@ -465,6 +465,40 @@ void __cpuinit mcheck_init(struct cpuinf
  * Character device to read and clear the MCE log.
  */
 
+static DEFINE_SPINLOCK(mce_state_lock);
+static int open_count; /* #times opened */
+static int open_exclu; /* already open exclusive? */
+
+static int mce_open(struct inode *inode, struct file *file)
+{
+   spin_lock(&mce_state_lock);
+
+   if (open_exclu || (open_count && (file->f_flags & O_EXCL))) {
+   spin_unlock(&mce_state_lock);
+   return -EBUSY;
+   }
+
+   if (file->f_flags & O_EXCL)
+   open_exclu = 1;
+   open_count++;
+
+   spin_unlock(&mce_state_lock);
+
+   return 0;
+}
+
+static int mce_release(struct inode *inode, struct file *file)
+{
+   spin_lock(&mce_state_lock);
+
+   open_count--;
+   open_exclu = 0;
+
+   spin_unlock(&mce_state_lock);
+
+   return 0;
+}
+
 static void collect_tscs(void *data) 
 { 
unsigned long *cpu_tsc = (unsigned long *)data;
@@ -553,6 +587,8 @@ static int mce_ioctl(struct inode *i, st
 }
 
 static const struct file_operations mce_chrdev_ops = {
+   .open = mce_open,
+   .release = mce_release,
.read = mce_read,
.ioctl = mce_ioctl,
 };
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Natsemi DP83815 driver spaming

2007-05-01 Thread Tim Hockin

On 5/1/07, Rafal Bilski <[EMAIL PROTECTED]> wrote:


2.6.21.1 is first kernel which I'm using at this device. Earlier it was
WindowsCE terminal. It is hardware fault. Commenting out the code is my
way to avoid "wakeup" messages in log, but I don't want to change anything
in vanilla kernel. I'm lucky that NIC is working at all.


I'm not sure what the right answer is.  The code was designed to do
the right thing, and yet in your case it's broken.  Does it need to be
a build option to work around broken hardware?  Yuck.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Natsemi DP83815 driver spaming

2007-05-02 Thread Tim Hockin

On 5/2/07, Rafal Bilski <[EMAIL PROTECTED]> wrote:


What about module option?


Looks ok to me, if this is the preferred route.


---
 drivers/net/natsemi.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/drivers/net/natsemi.c b/drivers/net/natsemi.c
index 349b96a..933e106 100644
--- a/drivers/net/natsemi.c
+++ b/drivers/net/natsemi.c
@@ -90,6 +90,10 @@ static int rx_copybreak;
 static int options[MAX_UNITS];
 static int full_duplex[MAX_UNITS];

+/* Used to disable "is PHY alive" check for compatibility with broken
+ * designs using rev. C chips. */
+static int disable_phy_check;
+
 /* Operational parameters that are set at compile time. */

 /* Keep the ring sizes a power of two for compile efficiency.
@@ -141,6 +145,7 @@ module_param(debug, int, 0);
 module_param(rx_copybreak, int, 0);
 module_param_array(options, int, NULL, 0);
 module_param_array(full_duplex, int, NULL, 0);
+module_param(disable_phy_check, int, 0);
 MODULE_PARM_DESC(mtu, "DP8381x MTU (all boards)");
 MODULE_PARM_DESC(debug, "DP8381x default debug level");
 MODULE_PARM_DESC(rx_copybreak,
@@ -148,6 +153,8 @@ MODULE_PARM_DESC(rx_copybreak,
 MODULE_PARM_DESC(options,
"DP8381x: Bits 0-3: media type, bit 17: full duplex");
 MODULE_PARM_DESC(full_duplex, "DP8381x full duplex setting(s) (1)");
+MODULE_PARM_DESC(disable_phy_check, "DP8381[56]: PHY lockup check disable"
+   " (all boards)");

 /*
Theory of Operation
@@ -1753,7 +1760,7 @@ static void netdev_timer(unsigned long data)
writew(1, ioaddr+PGSEL);
dspcfg = readw(ioaddr+DSPCFG);
writew(0, ioaddr+PGSEL);
-   if (dspcfg != np->dspcfg) {
+   if (!disable_phy_check && dspcfg != np->dspcfg) {
if (!netif_queue_stopped(dev)) {
spin_unlock_irq(&np->lock);
if (netif_msg_hw(np))
--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: support poll() on /dev/mcelog (try #2)

2007-05-02 Thread Tim Hockin

Newer version coming in a while.  Testing.

On 4/30/07, Tim Hockin <[EMAIL PROTECTED]> wrote:

From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 /dev/mcelog is typically polled manually.  This is less than optimal for
 situations where accurate accounting of MCEs is important.  Calling
 poll() on /dev/mcelog does not work.

Description:
 This patch adds support for poll() to /dev/mcelog.  This results in
 immediate wakeup of user apps whenever the poller finds MCEs.  Because
 the exception handler can not take any locks, it can not call the wakeup
 itself.  Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
 caught at the next return from interrupt or exit from idle, calling the
 mce_user_notify() routine.

 This patch also does some small cleanup for essentially unused variables,
 and moves the user notification into the body of the poller, so it is
 only called once per poll, rather than once per CPU.

Result:
 Applications can now poll() on /dev/mcelog.  When an error is logged
 (whether through the poller or through an exception) the applications are
 woken up promptly.  This should not affect any previous behaviors.  If no
 MCEs are being logged, there is no overhead.

Alternatives:
 I considered simply supporting poll() through the poller and not using
 TIF_MCE_NOTIFY at all.  However, the time between an uncorrectable error
 happening and the user application being notified is *the*most* critical
 window for us.  Many uncorrectable errors can be logged to the network if
 given a chance.

 I also considered doing the MCE poll directly from the idle notifier, but
 decided that was overkill.

Testing:
 I used an error-injecting DIMM to create lots of correctable DRAM errors
 and verified that my user app is woken up in sync with the polling interval.
 I also used the northbridge to inject uncorrectable ECC errors, and
 verified (printk() to the rescue) that the notify routine is called and the
 user app does wake up.

Caveats:
 I have seen a soft lockup with a call trace always similar to:
Call Trace:
   [] wake_up_process+0x10/0x20
[] softlockup_tick+0xea/0x110
[] run_local_timers+0x13/0x20
[] update_process_times+0x57/0x90
[] mcheck_check_cpu+0x0/0x40
[] smp_local_timer_interrupt+0x34/0x60
[] smp_apic_timer_interrupt+0x4e/0x70
[] apic_timer_interrupt+0x66/0x70

 I regressed this to the vanilla kernel, and it still happens.  It only
 crops up in the face of multiple uncorrectable errors.

Patch:
 This patch is against 2.6.21-rc7.

Signed-off-by: Tim Hockin <[EMAIL PROTECTED]>

---

This is the second version version of this patch.  The TIF_* approach was
suggested by Mike Waychison and Andi did not yell at me when I suggested
it.  Hooking the idle notifier was born of an Andrew Morton suggestion
and, no surprise, seems to work well.


diff -pruN linux-2.6.20+th/arch/x86_64/kernel/entry.S 
linux-2.6.20+th2v3/arch/x86_64/kernel/entry.S
--- linux-2.6.20+th/arch/x86_64/kernel/entry.S  2007-04-24 22:46:19.0 
-0700
+++ linux-2.6.20+th2v3/arch/x86_64/kernel/entry.S   2007-04-30 
10:57:43.0 -0700
@@ -282,7 +282,7 @@ sysret_careful:
 sysret_signal:
TRACE_IRQS_ON
sti
-   testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz1f

/* Really a signal */
@@ -375,7 +375,7 @@ int_very_careful:
jmp int_restore_rest

 int_signal:
-   testl $(_TIF_NOTIFY_RESUME|_TIF_SIGPENDING|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_NOTIFY_RESUME|_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz 1f
movq %rsp,%rdi  # &ptregs -> arg1
xorl %esi,%esi  # oldset -> arg2
@@ -597,9 +597,9 @@ retint_careful:
cli
TRACE_IRQS_OFF
jmp retint_check
-
+
 retint_signal:
-   testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jzretint_swapgs
TRACE_IRQS_ON
sti
diff -pruN linux-2.6.20+th/arch/x86_64/kernel/mce.c 
linux-2.6.20+th2v3/arch/x86_64/kernel/mce.c
--- linux-2.6.20+th/arch/x86_64/kernel/mce.c2007-04-27 14:19:08.0 
-0700
+++ linux-2.6.20+th2v3/arch/x86_64/kernel/mce.c 2007-04-30 22:19:25.0 
-0700
@@ -20,12 +20,15 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 

 #define MISC_MCELOG_MINOR 227
 #define NR_BANKS 6
@@ -39,8 +42,7 @@ static int mce_dont_init;
 static int tolerant = 1;
 static int banks;
 static unsigned long bank[NR_BANKS] = { [0 ... NR_BANKS-1] = ~0UL };
-static unsigned long console_logged;
-static int notify_user;
+static unsigned long notify_user;
 static int rip_msr;
 static 

[PATCH] x86_64: support poll() on /dev/mcelog (try #3)

2007-05-02 Thread Tim Hockin
From: Tim Hockin <[EMAIL PROTECTED]>

Background:
 /dev/mcelog is typically polled manually.  This is less than optimal for
 situations where accurate accounting of MCEs is important.  Calling
 poll() on /dev/mcelog does not work.

Description:
 This patch adds support for poll() to /dev/mcelog.  This results in
 immediate wakeup of user apps whenever the poller finds MCEs.  Because
 the exception handler can not take any locks, it can not call the wakeup
 itself.  Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
 caught at the next return from interrupt or exit from idle, calling the
 mce_user_notify() routine.  This patch also disables the "fake panic"
 path of the mce_panic(), because it results in printk()s in the exception
 handler and crashy systems.

 This patch also does some small cleanup for essentially unused variables,
 and moves the user notification into the body of the poller, so it is
 only called once per poll, rather than once per CPU.

Result:
 Applications can now poll() on /dev/mcelog.  When an error is logged
 (whether through the poller or through an exception) the applications are
 woken up promptly.  This should not affect any previous behaviors.  If no
 MCEs are being logged, there is no overhead.

Alternatives:
 I considered simply supporting poll() through the poller and not using
 TIF_MCE_NOTIFY at all.  However, the time between an uncorrectable error
 happening and the user application being notified is *the*most* critical
 window for us.  Many uncorrectable errors can be logged to the network if
 given a chance.

 I also considered doing the MCE poll directly from the idle notifier, but
 decided that was overkill.

Testing:
 I used an error-injecting DIMM to create lots of correctable DRAM errors
 and verified that my user app is woken up in sync with the polling interval.
 I also used the northbridge to inject uncorrectable ECC errors, and
 verified (printk() to the rescue) that the notify routine is called and the
 user app does wake up.  I built with PREEMPT on and off, and verified
 that my machine survives MCEs.

Patch:
 This patch is against 2.6.21-rc7.

Signed-off-by: Tim Hockin <[EMAIL PROTECTED]>

---

This is the third version version of this patch.  The TIF_* approach was
suggested by Mike Waychison and Andi did not yell at me when I suggested
it.  Hooking the idle notifier was born of an Andrew Morton suggestion
and, no surprise, seems to work well.


diff -pruN linux-2.6.20+01_poll_interval/arch/x86_64/kernel/entry.S 
linux-2.6.20+03_poll/arch/x86_64/kernel/entry.S
--- linux-2.6.20+01_poll_interval/arch/x86_64/kernel/entry.S2007-04-24 
22:46:19.0 -0700
+++ linux-2.6.20+03_poll/arch/x86_64/kernel/entry.S 2007-05-02 
20:50:38.0 -0700
@@ -282,7 +282,7 @@ sysret_careful:
 sysret_signal:
TRACE_IRQS_ON
sti
-   testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz1f
 
/* Really a signal */
@@ -375,7 +375,7 @@ int_very_careful:
jmp int_restore_rest

 int_signal:
-   testl $(_TIF_NOTIFY_RESUME|_TIF_SIGPENDING|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_NOTIFY_RESUME|_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz 1f
movq %rsp,%rdi  # &ptregs -> arg1
xorl %esi,%esi  # oldset -> arg2
@@ -599,7 +599,7 @@ retint_careful:
jmp retint_check

 retint_signal:
-   testl $(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP),%edx
+   testl 
$(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jzretint_swapgs
TRACE_IRQS_ON
sti
diff -pruN linux-2.6.20+01_poll_interval/arch/x86_64/kernel/mce.c 
linux-2.6.20+03_poll/arch/x86_64/kernel/mce.c
--- linux-2.6.20+01_poll_interval/arch/x86_64/kernel/mce.c  2007-04-27 
14:19:08.0 -0700
+++ linux-2.6.20+03_poll/arch/x86_64/kernel/mce.c   2007-05-02 
21:02:16.0 -0700
@@ -20,12 +20,15 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include  
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #define MISC_MCELOG_MINOR 227
 #define NR_BANKS 6
@@ -39,8 +42,7 @@ static int mce_dont_init;
 static int tolerant = 1;
 static int banks;
 static unsigned long bank[NR_BANKS] = { [0 ... NR_BANKS-1] = ~0UL };
-static unsigned long console_logged;
-static int notify_user;
+static unsigned long notify_user;
 static int rip_msr;
 static int mce_bootlog = 1;
 static atomic_t mce_events;
@@ -48,6 +50,8 @@ static atomic_t mce_events;
 static char trigger[128];
 static char *trigger_argv[2] = { trigger, NULL };
 
+static DECLARE_WAIT_QUEUE_HEAD(mce_wait);
+
 /*
  * Lockless MCE logging infrastructure.
  * This avoids deadlocks on printk locks without having to break locks. Also
@@ -94,8 +98,7 @@ void mce_log(struct mce *mce)
mcelog.entry[entry].finished 

Re: [PATCH 2/2] net: Implement SO_PEERCGROUP

2014-03-13 Thread Tim Hockin
In some sense a cgroup is a pgrp that mere mortals can't escape.  Why
not just do something like that?  root can set this "container id" or
"job id" on your process when it first starts (e.g. docker sets it on
your container process) or even make a cgroup that sets this for all
processes in that cgroup.

ints are better than strings anyway.

On Thu, Mar 13, 2014 at 10:25 AM, Andy Lutomirski  wrote:
> On Thu, Mar 13, 2014 at 9:33 AM, Simo Sorce  wrote:
>> On Thu, 2014-03-13 at 11:00 -0400, Vivek Goyal wrote:
>>> On Thu, Mar 13, 2014 at 10:55:34AM -0400, Simo Sorce wrote:
>>>
>>> [..]
>>> > > > This might not be quite as awful as I thought.  At least you're
>>> > > > looking up the cgroup at connection time instead of at send time.
>>> > > >
>>> > > > OTOH, this is still racy -- the socket could easily outlive the cgroup
>>> > > > that created it.
>>> > >
>>> > > That's a good point. What guarantees that previous cgroup was not
>>> > > reassigned to a different container.
>>> > >
>>> > > What if a process A opens the connection with sssd. Process A passes the
>>> > > file descriptor to a different process B in a differnt container.
>>> >
>>> > Stop right here.
>>> > If the process passes the fd it is not my problem anymore.
>>> > The process can as well just 'proxy' all the information to another
>>> > process.
>>> >
>>> > We just care to properly identify the 'original' container, we are not
>>> > in the business of detecting malicious behavior. That's something other
>>> > mechanism need to protect against (SELinux or other LSMs, normal
>>> > permissions, capabilities, etc...).
>>> >
>>> > > Process A exits. Container gets removed from system and new one gets
>>> > > launched which uses same cgroup as old one. Now process B sends a new
>>> > > request and SSSD will serve it based on policy of newly launched
>>> > > container.
>>> > >
>>> > > This sounds very similar to pid race where socket/connection will 
>>> > > outlive
>>> > > the pid.
>>> >
>>> > Nope, completely different.
>>> >
>>>
>>> I think you missed my point. Passing file descriptor is not the problem.
>>> Problem is reuse of same cgroup name for a different container while
>>> socket lives on. And it is same race as reuse of a pid for a different
>>> process.
>>
>> The cgroup name should not be reused of course, if userspace does that,
>> it is userspace's issue. cgroup names are not a constrained namespace
>> like pids which force the kernel to reuse them for processes of a
>> different nature.
>>
>
> You're proposing a feature that will enshrine cgroups into the API use
> by non-cgroup-controlling applications.  I don't think that anyone
> thinks that cgroups are pretty, so this is an unfortunate thing to
> have to do.
>
> I've suggested three different ways that your goal could be achieved
> without using cgroups at all.  You haven't really addressed any of
> them.
>
> In order for something like this to go into the kernel, I would expect
> a real use case and a justification for why this is the right way to
> do it.
>
> "Docker containers can be identified by cgroup path" is completely
> unconvincing to me.
>
> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] net: Implement SO_PEERCGROUP

2014-03-13 Thread Tim Hockin
I don't buy that it is not practical.  Not convenient, maybe.  Not
clean, sure.  But it is practical - it uses mechanisms that exist on
all kernels today.  That is a win, to me.

On Thu, Mar 13, 2014 at 10:58 AM, Simo Sorce  wrote:
> On Thu, 2014-03-13 at 10:55 -0700, Andy Lutomirski wrote:
>>
>> So give each container its own unix socket.  Problem solved, no?
>
> Not really practical if you have hundreds of containers.
>
> Simo.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tim Hockin
On Thu, Dec 12, 2013 at 6:21 AM, Tejun Heo  wrote:
> Hey, Tim.
>
> Sidenote: Please don't top-post with the whole body quoted below
> unless you're adding new cc's.  Please selectively quote the original
> message's body to remind the readers of the context and reply below
> it.  It's a basic lkml etiquette and one with good reasons.  If you
> have to top-post for whatever reason - say you're typing from a
> machine which doesn't allow easy editing of the original message,
> explain so at the top of the message, or better yet, wait till you can
> unless it's urgent.

Yeah sorry.  Replying from my phone is awkward at best.  I know better :)

> On Wed, Dec 11, 2013 at 09:37:46PM -0800, Tim Hockin wrote:
>> The immediate problem I see with setting aside reserves "off the top"
>> is that we don't really know a priori how much memory the kernel
>> itself is going to use, which could still land us in an overcommitted
>> state.
>>
>> In other words, if I have your 128 MB machine, and I set aside 8 MB
>> for OOM handling, and give 120 MB for jobs, I have not accounted for
>> the kernel.  So I set aside 8 MB for OOM and 100 MB for jobs, leaving
>> 20 MB for jobs.  That should be enough right?  Hell if I know, and
>> nothing ensures that.
>
> Yes, sure thing, that's the reason why I mentioned "with some slack"
> in the original message and also that it might not be completely the
> same.  It doesn't allow you to aggressively use system level OOM
> handling as the sizing estimator for the root cgroup; however, it's
> more of an implementation details than something which should guide
> the overall architecture - it's a problem which lessens in severity as
> [k]memcg improves and its coverage becomes more complete, which is the
> direction we should be headed no matter what.

In my mind, the ONLY point of pulling system-OOM handling into
userspace is to make it easier for crazy people (Google) to implement
bizarre system-OOM policies.  Example:

When we have a system OOM we want to do a walk of the administrative
memcg tree (which is only a couple levels deep, users can make
non-admin sub-memcgs), selecting the lowest priority entity at each
step (where both tasks and memcgs have a priority and the priority
range is much wider than the current OOM scores, and where memcg
priority is sometimes a function of memcg usage), until we reach a
leaf.

Once we reach a leaf, I want to log some info about the memcg doing
the allocation, the memcg being terminated, and maybe some other bits
about the system (depending on the priority of the selected victim,
this may or may not be an "acceptable" situation).  Then I want to
kill *everything* under that memcg.  Then I want to "publish" some
information through a sane API (e.g. not dmesg scraping).

This is basically our policy as we understand it today.  This is
notably different than it was a year ago, and it will probably evolve
further in the next year.

Teaching the kernel all of this stuff has proven to be sort of
difficult to maintain and forward-port, and has been very slow to
evolve because of how painful it is to test and deploy new kernels.

Maybe we can find a way to push this level of policy down to the
kernel OOM killer?  When this was mentioned internally I got shot down
(gently, but shot down none the less).  Assuming we had
nearly-reliable (it doesn't have to be 100% guaranteed to be useful)
OOM-in-userspace, I can keep the adminstrative memcg metadata in
memory, implement killing as cruelly as I need, and do all of the
logging and publication after the OOM kill is done.  Most importantly
I can test and deploy new policy changes pretty trivially.

Handling per-memcg OOM is a different discussion.  Here is where we
want to be able to extract things like heap profiles or take stats
snapshots, grow memcgs (if so configured) etc.  Allowing our users to
have a moment of mercy before we put a bullet in their brain enables a
whole new realm of debugging, as well as a lot of valuable features.

> It'd depend on the workload but with memcg fully configured it
> shouldn't fluctuate wildly.  If it does, we need to hunt down whatever
> is causing such fluctuatation and include it in kmemcg, right?  That
> way, memcg as a whole improves for all use cases not just your niche
> one and I strongly believe that aligning as many use cases as possible
> along the same axis, rather than creating a large hole to stow away
> the exceptions, is vastly more beneficial to *everyone* in the long
> term.

We have a long tail of kernel memory usage.  If we provision machines
so that the "do work here" first-level memcg excludes the average
kernel usage, we have a huge number of machines that will fail to
apply OOM 

Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

2013-12-12 Thread Tim Hockin
On Thu, Dec 12, 2013 at 11:23 AM, Tejun Heo  wrote:
> Hello, Tim.
>
> On Thu, Dec 12, 2013 at 10:42:20AM -0800, Tim Hockin wrote:
>> Yeah sorry.  Replying from my phone is awkward at best.  I know better :)
>
> Heh, sorry about being bitchy. :)
>
>> In my mind, the ONLY point of pulling system-OOM handling into
>> userspace is to make it easier for crazy people (Google) to implement
>> bizarre system-OOM policies.  Example:
>
> I think that's one of the places where we largely disagree.  If at all

Just to be clear - I say this because it doesn't feel right to impose
my craziness on others, and it sucks when we try and are met with
"you're crazy, go away".  And you have to admit that happens to
Google. :)  Punching an escape valve that allows us to be crazy
without hurting anyone else sounds ideal, IF and ONLY IF that escape
valve is itself maintainable.

If the escape valve is userspace it's REALLY easy to iterate on our
craziness.  If it is kernel space, it's somewhat less easy, but not
impossible.

> possible, I'd much prefer google's workload to be supported inside the
> general boundaries of the upstream kernel without having to punch a
> large hole in it.  To me, the general development history of memcg in
> general and this thread in particular seem to epitomize why it is a
> bad idea to have isolated, large and deep "crazy" use cases.  Punching
> the initial hole is the easy part; however, we all are quite limited
> in anticpating future needs and sooner or later that crazy use case is
> bound to evolve further towards the isolated extreme it departed
> towards and require more and larger holes and further contortions to
> accomodate such progress.
>
> The concern I have with the suggested solution is not necessarily that
> it's technically complex than it looks like on the surface - I'm sure
> it can be made to work one way or the other - but that it's a fairly
> large step toward an isolated extreme which memcg as a project
> probably should not head toward.
>
> There sure are cases where such exceptions can't be avoided and are
> good trade-offs but, here, we're talking about a major architectural
> decision which not only affects memcg but mm in general.  I'm afraid
> this doesn't sound like an no-brainer flexibility we can afford.
>
>> When we have a system OOM we want to do a walk of the administrative
>> memcg tree (which is only a couple levels deep, users can make
>> non-admin sub-memcgs), selecting the lowest priority entity at each
>> step (where both tasks and memcgs have a priority and the priority
>> range is much wider than the current OOM scores, and where memcg
>> priority is sometimes a function of memcg usage), until we reach a
>> leaf.
>>
>> Once we reach a leaf, I want to log some info about the memcg doing
>> the allocation, the memcg being terminated, and maybe some other bits
>> about the system (depending on the priority of the selected victim,
>> this may or may not be an "acceptable" situation).  Then I want to
>> kill *everything* under that memcg.  Then I want to "publish" some
>> information through a sane API (e.g. not dmesg scraping).
>>
>> This is basically our policy as we understand it today.  This is
>> notably different than it was a year ago, and it will probably evolve
>> further in the next year.
>
> I think per-memcg score and killing is something which makes
> fundamental sense.  In fact, killing a single process has never made
> much sense to me as that is a unit which ultimately is only meaningful
> to the kernel itself and not necessraily to userland, so no matter
> what I think we're gonna gain per-memcg behavior and it seems most,
> albeit not all, of what you described above should be implementable
> through that.

Well that's an awesome start.  We have or had patches to do a lot of
this.  I don't know how well scrubbed they are for pushing or whether
they apply at all to current head, though.

> Ultimately, if the use case calls for very fine level of control, I
> think the right thing to do is making nesting work properly which is
> likely to take some time.  In the meantime, even if such use case
> requires modifying the kernel to tailor the OOM behavior, I think
> sticking to kernel OOM provides a lot easier way to eventual
> convergence.  Userland system OOM basically means giving up and would
> lessen the motivation towards improving the shared infrastructures
> while adding significant pressure towards schizophreic diversion.
>
>> We have a long tail of kernel memory usage.  If we provision machines
>> so that the "do work here"

  1   2   >