Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Peter Zijlstra
On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
> >On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
> >>+   /*
> >>+* Request rescheduling unless we are in full dynticks mode.
> >>+* We would eventually get pre-empted without this, and if
> >>+* there's another task waiting, it would run; but by
> >>+* explicitly requesting the reschedule, we may reduce the
> >>+* latency.  We could directly call schedule() here as well,
> >>+* but since our caller is the standard place where schedule()
> >>+* is called, we defer to the caller.
> >>+*
> >>+* A more substantive approach here would be to use a struct
> >>+* completion here explicitly, and complete it when we shut
> >>+* down dynticks, but since we presumably have nothing better
> >>+* to do on this core anyway, just spinning seems plausible.
> >>+*/
> >>+   if (!tick_nohz_tick_stopped())
> >>+   set_tsk_need_resched(current);
> >This is broken.. and it would be really good if you don't actually need
> >to do this.
> 
> Can you elaborate?  We clearly do want to wait until we are in full
> dynticks mode before we return to userspace.
> 
> We could do it just in the prctl() syscall only, but then we lose the
> ability to implement the NOSIG mode, which can be a convenience.

So this isn't spelled out anywhere. Why does this need to be in the
return to user path?

> Even without that consideration, we really can't be sure we stay in
> dynticks mode if we disable the dynamic tick, but then enable interrupts,
> and end up taking an interrupt on the way back to userspace, and
> it turns the tick back on.  That's why we do it here, where we know
> interrupts will stay disabled until we get to userspace.

But but but.. task_isolation_enter() is explicitly ran with IRQs
_enabled_!! It even WARNs if they're disabled.

> So if we are doing it here, what else can/should we do?  There really
> shouldn't be any other tasks waiting to run at this point, so there's
> not a heck of a lot else to do on this core.  We could just spin and
> check need_resched and signal status manually instead, but that
> seems kind of duplicative of code already done in our caller here.

What !? I really don't get this, what are you waiting for? Why is
rescheduling making things better.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Peter Zijlstra
On Mon, Aug 29, 2016 at 12:53:30PM -0400, Chris Metcalf wrote:

> Would it be cleaner to just replace the set_tsk_need_resched() call
> with something like:
> 
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();
> __set_current_state(TASK_RUNNING);
> 
> or what would you recommend?

That'll just get you to sleep _forever_...

> Or, as I said, just doing a busy loop here while testing to see
> if need_resched or signal had been set?

Why do you care about need_resched() and or signals? How is that related
to the tick being stopped or not?

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-30 Thread Michal Hocko
On Mon 29-08-16 16:37:04, Michal Hocko wrote:
> [Sorry for a late reply, I was busy with other stuff]
> 
> On Mon 22-08-16 15:44:53, Sonny Rao wrote:
> > On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko  wrote:
> [...]
> > But what about the private_clean and private_dirty?  Surely
> > those are more generally useful for calculating a lower bound on
> > process memory usage without additional knowledge?
> 
> I guess private_clean can be used as a reasonable estimate.

I was thinking about this more and I think I am wrong here. Even truly
MAP_PRIVATE|MAP_ANON will be in private_dirty. So private_clean will
become not all that interesting and similarly misleading as its _dirty
variant (mmaped file after [m]sync should become _clean) and that
doesn't mean the memory will get freed after the process which maps it
terminates. Take shmem as an example again.

> private_dirty less so because it may refer to e.g. tmpfs which is not
> mapped by other process and so no memory would be freed after unmap
> without removing the file.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-30 Thread Michal Hocko
On Wed 24-08-16 12:14:06, Marcin Jabrzyk wrote:
[...]
> Sorry to hijack the thread, but I've found it recently
> and I guess it's the best place to present our point.
> We are working at our custom OS based on Linux and we also suffered much
> by /proc//smaps file. As in Chrome we tried to improve our internal
> application memory management polices (Low Memory Killer) using data
> provided by smaps but we failed due to very long time needed for reading
> and parsing properly the file.

I was already questioning Pss and also Private_* for any memory killer
purpose earlier in the thread because cumulative numbers for all
mappings can be really meaningless. Especially when you do not know
about which resource is shared and by whom. Maybe you can describe how
you are using those cumulative numbers for your decisions and prove me
wrong but I simply haven't heard any sound arguments so far. Everything
was just "we know what we are doing in our environment so we know those
resouces and therefore those numbers make sense to us". But with all due
respect this is not a reason to add a user visible API into the kernel.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] leds: trigger: Introduce an USB port trigger

2016-08-30 Thread Greg KH
On Fri, Aug 26, 2016 at 05:38:05PM +0200, Rafał Miłecki wrote:
> On 25 August 2016 at 14:49, Greg KH  wrote:
> > On Thu, Aug 25, 2016 at 10:03:52AM +0200, Rafał Miłecki wrote:
> >> +static void usbport_trig_activate(struct led_classdev *led_cdev)
> >> +{
> >> + struct usbport_trig_data *usbport_data;
> >> + int err;
> >> +
> >> + usbport_data = kzalloc(sizeof(*usbport_data), GFP_KERNEL);
> >> + if (!usbport_data)
> >> + return;
> >> + usbport_data->led_cdev = led_cdev;
> >> +
> >> + /* Storing ports */
> >> + INIT_LIST_HEAD(&usbport_data->ports);
> >> + usbport_data->ports_dir = kobject_create_and_add("ports",
> >> +  
> >> &led_cdev->dev->kobj);
> >
> > If you _ever_ find yourself in a driver having to use kobject calls,
> > it's a HUGE hint that you are doing something wrong.
> >
> > Hint, you are doing this wrong :)
> >
> > Use an attribute group if you need a subdirectory in sysfs, using a
> > "raw" kobject like this hides it from all userspace tools and so no
> > userspace program can see it (hint, try using libudev to access these
> > files attached to the device...)
> 
> Oops. Thanks for pointing groups to me. I was looking at sysfs.h
> earlier but I didn't realize group can be a subdirectory. I can see
> these sysfs_create_group(s) and friends now, thanks.

No, use an attribute group, and name it, and the subdir will be created
automagically for you.

> >> + if (!usbport_data->ports_dir)
> >> + goto err_free;
> >> +
> >> + /* API for ports management */
> >> + err = device_create_file(led_cdev->dev, &dev_attr_new_port);
> >> + if (err)
> >> + goto err_put_ports;
> >> + err = device_create_file(led_cdev->dev, &dev_attr_remove_port);
> >> + if (err)
> >> + goto err_remove_new_port;
> >
> > Doesn't this race with userspace and fail?  Shouldn't the led core be
> > creating your leds for you?
> 
> These questions aren't clear to me. What kind of race? Doing
> echo usbport > trigger
> results in trigger core calling usbport_trig_activate. We create new
> attributes and then we return.

Why aren't the attributes present when the device is found?  What is
"trigger" for?

> I'm also not creating any leds there. This already has to be LED
> available if you want to assign some trigger to it.

That sounds really odd...

> >> +
> >> + /* Notifications */
> >> + usbport_data->nb.notifier_call = usbport_trig_notify,
> >> + led_cdev->trigger_data = usbport_data;
> >> + usb_register_notify(&usbport_data->nb);
> >
> > Don't abuse the USB notifier chain with stuff like this please, is that
> > really necessary?  Why can't your hardware just tell you what USB ports
> > are in use "out of band"?
> 
> I totally don't understand this part. This LED trigger is supposed to
> react to USB devices appearing at specified ports. How else can I know
> there is a new USB device if not by notifications?
> I'm wondering if we are on the same page. Could it be my idea of this
> LED trigger is not clear at all? Could you take a look at commit
> message again, please? Mostly this part:
> > This commit adds a new trigger responsible for turning on LED when USB
> > device gets connected to the specified USB port. This can can useful for
> > various home routers that have USB port(s) and a proper LED telling user
> > a device is connected.
> 
> Can I add something more to make it clear what this trigger is responsible 
> for?

Yes please, as I totally missed that.

And the USB notifier was created for some "special" parts of the USB
core to use, this feels like an abuse of that in some way.  But it's
hard to define, I know...

> >> +
> >> + led_cdev->activated = true;
> >> + return;
> >> +
> >> +err_remove_new_port:
> >> + device_remove_file(led_cdev->dev, &dev_attr_new_port);
> >> +err_put_ports:
> >> + kobject_put(usbport_data->ports_dir);
> >> +err_free:
> >> + kfree(usbport_data);
> >> +}
> >
> > And again, why is this USB specific?  Why can't you use this same
> > userspace api and codebase for PCI ports?  For a sdcard port?  For a
> > thunderbolt port?
> 
> I'm leaving this one unanswered as discussion on this continued in
> V3.5 thread below my reply:
> On 25 August 2016 at 07:14, Rafał Miłecki  wrote:
> > Good question. I would like to extend this USB port trigger in the
> > future by reacting to USB activity. This involves playing with URBs
> > and I believe that at that point it'd be getting too much USB specific
> > to /rule them all/.

No, I mean why not have this specific activity LED work for all types of
devices, not just USB.

Not that this specific LED works for other types of USB activity.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 04/20] x86: Secure Memory Encryption (SME) support

2016-08-30 Thread Tom Lendacky
On 08/25/2016 08:04 AM, Thomas Gleixner wrote:
> On Mon, 22 Aug 2016, Tom Lendacky wrote:
> 
>> Provide support for Secure Memory Encryption (SME). This initial support
>> defines the memory encryption mask as a variable for quick access and an
>> accessor for retrieving the number of physical addressing bits lost if
>> SME is enabled.
> 
> What is the reason that this needs to live in assembly code?

In later patches this code is expanded and deals with a lot of page
table manipulation, cpuid/rdmsr instructions, etc. and so I thought it
was best to do it this way.

Thanks,
Tom

>  
> Thanks,
> 
>   tglx
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 04/20] x86: Secure Memory Encryption (SME) support

2016-08-30 Thread Andy Lutomirski
On Aug 30, 2016 6:34 AM, "Tom Lendacky"  wrote:
>
> On 08/25/2016 08:04 AM, Thomas Gleixner wrote:
> > On Mon, 22 Aug 2016, Tom Lendacky wrote:
> >
> >> Provide support for Secure Memory Encryption (SME). This initial support
> >> defines the memory encryption mask as a variable for quick access and an
> >> accessor for retrieving the number of physical addressing bits lost if
> >> SME is enabled.
> >
> > What is the reason that this needs to live in assembly code?
>
> In later patches this code is expanded and deals with a lot of page
> table manipulation, cpuid/rdmsr instructions, etc. and so I thought it
> was best to do it this way.

None of that sounds like it needs to be in asm, though.

I, at least, have a strong preference for minimizing the amount of asm
in the low-level arch code.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/30/2016 3:58 AM, Peter Zijlstra wrote:

On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:

On 8/29/2016 12:33 PM, Peter Zijlstra wrote:

On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:

+   /*
+* Request rescheduling unless we are in full dynticks mode.
+* We would eventually get pre-empted without this, and if
+* there's another task waiting, it would run; but by
+* explicitly requesting the reschedule, we may reduce the
+* latency.  We could directly call schedule() here as well,
+* but since our caller is the standard place where schedule()
+* is called, we defer to the caller.
+*
+* A more substantive approach here would be to use a struct
+* completion here explicitly, and complete it when we shut
+* down dynticks, but since we presumably have nothing better
+* to do on this core anyway, just spinning seems plausible.
+*/
+   if (!tick_nohz_tick_stopped())
+   set_tsk_need_resched(current);

This is broken.. and it would be really good if you don't actually need
to do this.

Can you elaborate?  We clearly do want to wait until we are in full
dynticks mode before we return to userspace.

We could do it just in the prctl() syscall only, but then we lose the
ability to implement the NOSIG mode, which can be a convenience.

So this isn't spelled out anywhere. Why does this need to be in the
return to user path?


I'm not sure where this should be spelled out, to be honest.  I guess
I can add some commentary to the commit message explaining this part.

The basic idea is just that we don't want to be at risk from the
dyntick getting enabled.  Similarly, we don't want to be at risk of a
later global IPI due to lru_add_drain stuff, for example.  And, we may
want to add additional stuff, like catching kernel TLB flushes and
deferring them when a remote core is in userspace.  To do all of this
kind of stuff, we need to run in the return to user path so we are
late enough to guarantee no further kernel things will happen to
perturb our carefully-arranged isolation state that includes dyntick
off, per-cpu lru cache empty, etc etc.


Even without that consideration, we really can't be sure we stay in
dynticks mode if we disable the dynamic tick, but then enable interrupts,
and end up taking an interrupt on the way back to userspace, and
it turns the tick back on.  That's why we do it here, where we know
interrupts will stay disabled until we get to userspace.

But but but.. task_isolation_enter() is explicitly ran with IRQs
_enabled_!! It even WARNs if they're disabled.


Yes, true!  But if you pop up to the caller, the key thing is the
task_isolation_ready() routine where we are invoked with interrupts
disabled, and we confirm that all our criteria are met (including
tick_nohz_tick_stopped), and then leave interrupts disabled as we
return from there onwards to userspace.

The task_isolation_enter() code just does its best-faith attempt to
make sure all these criteria are met, just like all the other TIF_xxx
flag tests do in exit_to_usermode_loop() on x86, like scheduling,
delivering signals, etc.  As you know, we might run that code, go
around the loop, and discover that the TIF flag has been re-set, and
we have to run the code again before all of that stuff has "quiesced".
The isolation code uses that same model; the only difference is that
we clear the TIF flag manually in the loop by checking
task_isolation_ready().


So if we are doing it here, what else can/should we do?  There really
shouldn't be any other tasks waiting to run at this point, so there's
not a heck of a lot else to do on this core.  We could just spin and
check need_resched and signal status manually instead, but that
seems kind of duplicative of code already done in our caller here.

What !? I really don't get this, what are you waiting for? Why is
rescheduling making things better.


We need to wait for the last dyntick to fire before we can return to
userspace.  There are plenty of options as to what we can do in the
meanwhile.

1. Try to schedule().  Good luck with that in practice, since a
userspace process that has enabled task isolation is going to be alone
on its core unless something pretty broken is happening on the system.
But, at least folks understand the idiom of scheduling out while you wait.

2. Another variant of that: set up a wait completion and have the
dynticks code complete it when the tick turns off.  But this adds
complexity to option 1, and really doesn't buy us much in practice
that I can see.

3. Just admit that we are likely alone on the core, and just burn
cycles in a busy loop waiting for that last tick to fire.  Obviously
if we do this we also need to test for signals and resched so the core
remains responsive.  We can either do this in a loop just by spinning
explicitly, or I could literally just remove the line in the current
patchset that sets TIF_NEED_R

Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Andy Lutomirski
On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf  wrote:
> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
>>
>> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>>>
>>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:

 On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>
> +   /*
> +* Request rescheduling unless we are in full dynticks mode.
> +* We would eventually get pre-empted without this, and if
> +* there's another task waiting, it would run; but by
> +* explicitly requesting the reschedule, we may reduce the
> +* latency.  We could directly call schedule() here as well,
> +* but since our caller is the standard place where schedule()
> +* is called, we defer to the caller.
> +*
> +* A more substantive approach here would be to use a struct
> +* completion here explicitly, and complete it when we shut
> +* down dynticks, but since we presumably have nothing better
> +* to do on this core anyway, just spinning seems plausible.
> +*/
> +   if (!tick_nohz_tick_stopped())
> +   set_tsk_need_resched(current);

 This is broken.. and it would be really good if you don't actually need
 to do this.
>>>
>>> Can you elaborate?  We clearly do want to wait until we are in full
>>> dynticks mode before we return to userspace.
>>>
>>> We could do it just in the prctl() syscall only, but then we lose the
>>> ability to implement the NOSIG mode, which can be a convenience.
>>
>> So this isn't spelled out anywhere. Why does this need to be in the
>> return to user path?
>
>
> I'm not sure where this should be spelled out, to be honest.  I guess
> I can add some commentary to the commit message explaining this part.
>
> The basic idea is just that we don't want to be at risk from the
> dyntick getting enabled.  Similarly, we don't want to be at risk of a
> later global IPI due to lru_add_drain stuff, for example.  And, we may
> want to add additional stuff, like catching kernel TLB flushes and
> deferring them when a remote core is in userspace.  To do all of this
> kind of stuff, we need to run in the return to user path so we are
> late enough to guarantee no further kernel things will happen to
> perturb our carefully-arranged isolation state that includes dyntick
> off, per-cpu lru cache empty, etc etc.

None of the above should need to *loop*, though, AFAIK.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] drivers: Add visorbus to the drivers/virt directory

2016-08-30 Thread Sell, Timothy C
> -Original Message-
> From: Greg KH [mailto:gre...@linuxfoundation.org]
> Sent: Sunday, August 21, 2016 2:05 PM
> To: Kershner, David A 
> 
> On Fri, Jun 10, 2016 at 11:23:56PM -0400, David Kershner wrote:
> > visorbus is currently located at drivers/staging/visorbus,
> > this patch moves it to drivers/virt.
> >
> > Signed-off-by: David Kershner 
> > Reviewed-by: Tim Sell 
> 
> I picked one random file here, this last one, and found a number of
> "odd" things in it.
> 
> So, given that I can't really comment on the patch itself, I'm going to
> include the file below, quote it, and then provide some comments, ok?

There's obviously a lot of work we have to do here (and that work is
underway), but I'd like to focus this email on explaining a class of comments
you made about our error-handling.

You have correctly pointed out some areas where we are deficient in our
error-handling, but I think there are others where our error-handling is
indeed sufficient, but this is perhaps NOT obvious because we are sending
error codes to our partitioning firmware environment (via a message-passing
interface across a shared-memory interface known as the 'controlvm channel')
instead of specifying them in function return values.  This partitioning
firmware environment is NOT Linux, and is external to the Linux OS environment
where the visorbus driver is running.

It helps to understand the overall flow of visorchipset.c, which basically
fills the role of the server responsible for handling requests initiated by
the partitioning firmware environment:

* controlvm_periodic_work() / handle_command() reads a request from the
  partitioning firmware environment, which will be something like "create a
  virtual bus" or "create a virtual device".

* A case statement in handle_command() calls one of various functions to handle
  the request, which sends the success/failure result code of the operation
  back to the partitioning firmware environment using one of the paths:
  * bus_epilog() / bus_responder() / controlvm_respond()
  * device_epilog() / device_responder() / controlvm_respond()
  This success/failure result code is indicated via one of those
  CONTROLVM_RESP_* error mnemonics (which you clearly indicated your hatred
  for, and which we will change).

* The partitioning firmware environment will react accordingly to the error
  code returned.  E.g., if the request was to create the virtual bus containing
  the boot device, and the error code was CONTROLVM_RESP_ERROR_KMALLOC_FAILED,
  the partitioning firmware environment would fail the boot of the Linux guest
  and inform the user that the Linux guest boot failed due to an out-of-memory
  condition.

(This also should clarify why we use awkward mnemonics like 
CONTROLVM_RESP_ERROR_KMALLOC_FAILED instead of -ENOMEM -- it's because the
error codes are not defined by a Linux environment, because it isn't a Linux
environment that is actually making the controlvm requests.  But that's another
issue.)

Let me use visorchipset.c bus_create() as an example:

static void
bus_create(struct controlvm_message *inmsg)
{
 struct controlvm_message_packet *cmd = &inmsg->cmd;
 u32 bus_no = cmd->create_bus.bus_no;
 int rc = CONTROLVM_RESP_SUCCESS;
 struct visor_device *bus_info;
 struct visorchannel *visorchannel;

 bus_info = visorbus_get_device_by_id(bus_no, BUS_ROOT_DEVICE, NULL);
 if (bus_info && (bus_info->state.created == 1)) {
 POSTCODE_LINUX_3(BUS_CREATE_FAILURE_PC, bus_no,
  POSTCODE_SEVERITY_ERR);
 rc = -CONTROLVM_RESP_ERROR_ALREADY_DONE;
 goto out_bus_epilog;
 }
 bus_info = kzalloc(sizeof(*bus_info), GFP_KERNEL);
 if (!bus_info) {
 POSTCODE_LINUX_3(BUS_CREATE_FAILURE_PC, bus_no,
  POSTCODE_SEVERITY_ERR);
 rc = -CONTROLVM_RESP_ERROR_KMALLOC_FAILED;
 goto out_bus_epilog;
 }

 INIT_LIST_HEAD(&bus_info->list_all);
 bus_info->chipset_bus_no = bus_no;
 bus_info->chipset_dev_no = BUS_ROOT_DEVICE;

 POSTCODE_LINUX_3(BUS_CREATE_ENTRY_PC, bus_no, POSTCODE_SEVERITY_INFO);

 visorchannel = visorchannel_create(cmd->create_bus.channel_addr,
cmd->create_bus.channel_bytes,
GFP_KERNEL,
cmd->create_bus.bus_data_type_uuid);

 if (!visorchannel) {
 POSTCODE_LINUX_3(BUS_CREATE_FAILURE_PC, bus_no,
  POSTCODE_SEVERITY_ERR);
 rc = -CONTROLVM_RESP_ERROR_KMALLOC_FAILED;
 kfree(bus_info);
 bus_info = NULL;
 goto out_bus_epilog;
 }
 bus_info->visorchannel = visorchannel;
 if (uuid_le_cmp(cmd->create_bus.bus_inst_uuid, spar_siovm_uuid) == 0) {
 

Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-30 Thread Frederic Weisbecker
On Tue, Aug 30, 2016 at 11:41:36AM -0400, Chris Metcalf wrote:
> On 8/29/2016 8:55 PM, Frederic Weisbecker wrote:
> >On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:
> >>On 8/11/2016 2:50 PM, Christoph Lameter wrote:
> >>>On Thu, 11 Aug 2016, Frederic Weisbecker wrote:
> >>>
> Do we need to quiesce vmstat everytime before entering userspace?
> I thought that vmstat only need to be offlined once and for all?
> >>>Once is sufficient after disabling the tick.
> >>It's true that task_isolation_enter() is called every time before
> >>returning to user space while task isolation is enabled.
> >>
> >>But once we enter the kernel again after returning from the initial
> >>prctl() -- assuming we are in NOSIG mode so doing so is legal in the
> >>first place -- almost anything can happen, certainly including
> >>restarting the tick.  Thus, we have to make sure that normal quiescing
> >>happens again before we return to userspace.
> >Yes but we need to sort out what needs to be called only once on prctl().
> >
> >Once vmstat is quiesced, it's not going to need quiescing again even if we
> >restart the tick.
> 
> That's true, but I really do like the idea of having a clean structure
> where we verify all our prerequisites in task_isolation_ready(), and
> have code to try to get things fixed up in task_isolation_enter().
> If we start moving some things here and some things there, it gets
> harder to manage.  I think by testing "!vmstat_idle()" in
> task_isolation_enter() we are avoiding any substantial overhead.

I think that making the code clearer on what needs to be done once for
all on prctl() and what needs to be done on all actual syscall return
is more important for readability.

> 
> I think it would be clearer to rename task_isolation_enter()
> to task_isolation_prepare(); it might be less confusing.

For the prctl part, why not.

> 
> Remember too that in general, we really don't need to think about
> return-to-userspace as a hot path for task isolation, unlike how we
> think about it all the rest of the time.  So it makes sense to
> prioritize keeping things clean from a software development
> perspective first, and high-performance only secondarily.
> 
> >>The thing to remember is that this is only relevant if the user has
> >>explicitly requested the NOSIG behavior from task isolation, which we
> >>don't really expect to be the default - we are implicitly encouraging
> >>use of the default semantics of "you can't enter the kernel again
> >>until you turn off isolation".
> >That's right. Although NOSIG is the only thing we can afford as long as
> >we drag around the 1Hz.
> 
> True enough.  Hopefully we'll finish sorting that out soon enough.
> 
> + if (!tick_nohz_tick_stopped())
> + set_tsk_need_resched(current);
> Again, that won't help
> >>It won't be better than spinning in a loop if there aren't any other
> >>schedulable processes, but it won't be worse either.  If there is
> >>another schedulable process, we at least will schedule it sooner than
> >>if we just sat in a busy loop and waited for the scheduler to kick
> >>us. But there's nothing else we can do anyway if we want to maintain
> >>the guarantee that the dyn tick is stopped before return to userspace.
> >I don't think it helps either way. If reschedule is pending, the current
> >task already has TIF_RESCHED set.
> 
> See the other thread with Peter Z for the longer discussion of this.
> At this point I'm leaning towards replacing the set_tsk_need_resched() with
> 
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();
> __set_current_state(TASK_RUNNING);

I don't see how that helps. What will wake the thread up except a signal?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/29/2016 8:55 PM, Frederic Weisbecker wrote:

On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:

On 8/11/2016 2:50 PM, Christoph Lameter wrote:

On Thu, 11 Aug 2016, Frederic Weisbecker wrote:


Do we need to quiesce vmstat everytime before entering userspace?
I thought that vmstat only need to be offlined once and for all?

Once is sufficient after disabling the tick.

It's true that task_isolation_enter() is called every time before
returning to user space while task isolation is enabled.

But once we enter the kernel again after returning from the initial
prctl() -- assuming we are in NOSIG mode so doing so is legal in the
first place -- almost anything can happen, certainly including
restarting the tick.  Thus, we have to make sure that normal quiescing
happens again before we return to userspace.

Yes but we need to sort out what needs to be called only once on prctl().

Once vmstat is quiesced, it's not going to need quiescing again even if we
restart the tick.


That's true, but I really do like the idea of having a clean structure
where we verify all our prerequisites in task_isolation_ready(), and
have code to try to get things fixed up in task_isolation_enter().
If we start moving some things here and some things there, it gets
harder to manage.  I think by testing "!vmstat_idle()" in
task_isolation_enter() we are avoiding any substantial overhead.

I think it would be clearer to rename task_isolation_enter()
to task_isolation_prepare(); it might be less confusing.

Remember too that in general, we really don't need to think about
return-to-userspace as a hot path for task isolation, unlike how we
think about it all the rest of the time.  So it makes sense to
prioritize keeping things clean from a software development
perspective first, and high-performance only secondarily.


The thing to remember is that this is only relevant if the user has
explicitly requested the NOSIG behavior from task isolation, which we
don't really expect to be the default - we are implicitly encouraging
use of the default semantics of "you can't enter the kernel again
until you turn off isolation".

That's right. Although NOSIG is the only thing we can afford as long as
we drag around the 1Hz.


True enough.  Hopefully we'll finish sorting that out soon enough.


+   if (!tick_nohz_tick_stopped())
+   set_tsk_need_resched(current);
Again, that won't help

It won't be better than spinning in a loop if there aren't any other
schedulable processes, but it won't be worse either.  If there is
another schedulable process, we at least will schedule it sooner than
if we just sat in a busy loop and waited for the scheduler to kick
us. But there's nothing else we can do anyway if we want to maintain
the guarantee that the dyn tick is stopped before return to userspace.

I don't think it helps either way. If reschedule is pending, the current
task already has TIF_RESCHED set.


See the other thread with Peter Z for the longer discussion of this.
At this point I'm leaning towards replacing the set_tsk_need_resched() with

set_current_state(TASK_INTERRUPTIBLE);
schedule();
__set_current_state(TASK_RUNNING);

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/30/2016 12:30 PM, Andy Lutomirski wrote:

On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf  wrote:

On 8/30/2016 3:58 AM, Peter Zijlstra wrote:

On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:

On 8/29/2016 12:33 PM, Peter Zijlstra wrote:

On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:

+   /*
+* Request rescheduling unless we are in full dynticks mode.
+* We would eventually get pre-empted without this, and if
+* there's another task waiting, it would run; but by
+* explicitly requesting the reschedule, we may reduce the
+* latency.  We could directly call schedule() here as well,
+* but since our caller is the standard place where schedule()
+* is called, we defer to the caller.
+*
+* A more substantive approach here would be to use a struct
+* completion here explicitly, and complete it when we shut
+* down dynticks, but since we presumably have nothing better
+* to do on this core anyway, just spinning seems plausible.
+*/
+   if (!tick_nohz_tick_stopped())
+   set_tsk_need_resched(current);

This is broken.. and it would be really good if you don't actually need
to do this.

Can you elaborate?  We clearly do want to wait until we are in full
dynticks mode before we return to userspace.

We could do it just in the prctl() syscall only, but then we lose the
ability to implement the NOSIG mode, which can be a convenience.

So this isn't spelled out anywhere. Why does this need to be in the
return to user path?


I'm not sure where this should be spelled out, to be honest.  I guess
I can add some commentary to the commit message explaining this part.

The basic idea is just that we don't want to be at risk from the
dyntick getting enabled.  Similarly, we don't want to be at risk of a
later global IPI due to lru_add_drain stuff, for example.  And, we may
want to add additional stuff, like catching kernel TLB flushes and
deferring them when a remote core is in userspace.  To do all of this
kind of stuff, we need to run in the return to user path so we are
late enough to guarantee no further kernel things will happen to
perturb our carefully-arranged isolation state that includes dyntick
off, per-cpu lru cache empty, etc etc.

None of the above should need to *loop*, though, AFAIK.


Ordering is a problem, though.

We really want to run task isolation last, so we can guarantee that
all the isolation prerequisites are met (dynticks stopped, per-cpu lru
cache empty, etc).  But achieving that state can require enabling
interrupts - most obviously if we have to schedule, e.g. for vmstat
clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
just while waiting for that last dyntick interrupt to occur.  I'm also
not sure that even something as simple as draining the per-cpu lru
cache can be done holding interrupts disabled throughout - certainly
there's a !SMP code path there that just re-enables interrupts
unconditionally, which gives me pause.

At any rate at that point you need to retest for signals, resched,
etc, all as usual, and then you need to recheck the task isolation
prerequisites once more.

I may be missing something here, but it's really not obvious to me
that there's a way to do this without having task isolation integrated
into the usual return-to-userspace loop.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Andy Lutomirski
On Aug 30, 2016 10:02 AM, "Chris Metcalf"  wrote:
>
> On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
>>
>> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf  wrote:
>>>
>>> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:

 On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>
> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>
>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>
>>> +   /*
>>> +* Request rescheduling unless we are in full dynticks mode.
>>> +* We would eventually get pre-empted without this, and if
>>> +* there's another task waiting, it would run; but by
>>> +* explicitly requesting the reschedule, we may reduce the
>>> +* latency.  We could directly call schedule() here as well,
>>> +* but since our caller is the standard place where schedule()
>>> +* is called, we defer to the caller.
>>> +*
>>> +* A more substantive approach here would be to use a struct
>>> +* completion here explicitly, and complete it when we shut
>>> +* down dynticks, but since we presumably have nothing better
>>> +* to do on this core anyway, just spinning seems plausible.
>>> +*/
>>> +   if (!tick_nohz_tick_stopped())
>>> +   set_tsk_need_resched(current);
>>
>> This is broken.. and it would be really good if you don't actually need
>> to do this.
>
> Can you elaborate?  We clearly do want to wait until we are in full
> dynticks mode before we return to userspace.
>
> We could do it just in the prctl() syscall only, but then we lose the
> ability to implement the NOSIG mode, which can be a convenience.

 So this isn't spelled out anywhere. Why does this need to be in the
 return to user path?
>>>
>>>
>>> I'm not sure where this should be spelled out, to be honest.  I guess
>>> I can add some commentary to the commit message explaining this part.
>>>
>>> The basic idea is just that we don't want to be at risk from the
>>> dyntick getting enabled.  Similarly, we don't want to be at risk of a
>>> later global IPI due to lru_add_drain stuff, for example.  And, we may
>>> want to add additional stuff, like catching kernel TLB flushes and
>>> deferring them when a remote core is in userspace.  To do all of this
>>> kind of stuff, we need to run in the return to user path so we are
>>> late enough to guarantee no further kernel things will happen to
>>> perturb our carefully-arranged isolation state that includes dyntick
>>> off, per-cpu lru cache empty, etc etc.
>>
>> None of the above should need to *loop*, though, AFAIK.
>
>
> Ordering is a problem, though.
>
> We really want to run task isolation last, so we can guarantee that
> all the isolation prerequisites are met (dynticks stopped, per-cpu lru
> cache empty, etc).  But achieving that state can require enabling
> interrupts - most obviously if we have to schedule, e.g. for vmstat
> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
> just while waiting for that last dyntick interrupt to occur.  I'm also
> not sure that even something as simple as draining the per-cpu lru
> cache can be done holding interrupts disabled throughout - certainly
> there's a !SMP code path there that just re-enables interrupts
> unconditionally, which gives me pause.
>
> At any rate at that point you need to retest for signals, resched,
> etc, all as usual, and then you need to recheck the task isolation
> prerequisites once more.
>
> I may be missing something here, but it's really not obvious to me
> that there's a way to do this without having task isolation integrated
> into the usual return-to-userspace loop.
>

What if we did it the other way around: set a percpu flag saying
"going quiescent; disallow new deferred work", then finish all
existing work and return to userspace.  Then, on the next entry, clear
that flag.  With the flag set, vmstat would just flush anything that
it accumulates immediately, nothing would be added to the LRU list,
etc.

Also, this cond_resched stuff doesn't worry me too much at a
fundamental level -- if we're really going quiescent, shouldn't we be
able to arrange that there are no other schedulable tasks on the CPU
in question?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/30/2016 1:36 PM, Chris Metcalf wrote:

See the other thread with Peter Z for the longer discussion of this.
At this point I'm leaning towards replacing the set_tsk_need_resched() with

 set_current_state(TASK_INTERRUPTIBLE);
 schedule();
 __set_current_state(TASK_RUNNING);

I don't see how that helps. What will wake the thread up except a signal?


The answer is that the scheduler will keep bringing us back to this
point (either after running another runnable task if there is one,
or else just returning to this point immediately without doing a
context switch), and we will then go around the "prepare exit to
userspace" loop and perhaps discover that enough time has gone
by that the last dyntick interrupt has triggered and the kernel has
quiesced the dynticks.  At that point we stop calling schedule()
over and over and can return normally to userspace. 


Oops, you're right that if I set TASK_INTERRUPTIBLE, then if I call
schedule(), I never get control back.  So I don't want to do that.

I suppose I could do a schedule_timeout() here instead and try
to figure out how long to wait for the next dyntick.  But since we
don't expect anything else running on the core anyway, it seems
like we are probably working too hard at this point.  I don't think
it's worth it just to go into the idle task and (possibly) save some
power for a few microseconds.

The more I think about this, the more I think I am micro-optimizing
by trying to poke the scheduler prior to some external thing setting
need_resched, so I think the thing to do here is in fact, nothing.
I won't worry about rescheduling but will just continue going around
the prepare-exit-to-userspace loop until the last dyn tick fires.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Andy Lutomirski
On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf  wrote:
> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>
>> On Aug 30, 2016 10:02 AM, "Chris Metcalf"  wrote:
>>>
>>> On 8/30/2016 12:30 PM, Andy Lutomirski wrote:

 On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf 
 wrote:
>
> The basic idea is just that we don't want to be at risk from the
> dyntick getting enabled.  Similarly, we don't want to be at risk of a
> later global IPI due to lru_add_drain stuff, for example.  And, we may
> want to add additional stuff, like catching kernel TLB flushes and
> deferring them when a remote core is in userspace.  To do all of this
> kind of stuff, we need to run in the return to user path so we are
> late enough to guarantee no further kernel things will happen to
> perturb our carefully-arranged isolation state that includes dyntick
> off, per-cpu lru cache empty, etc etc.

 None of the above should need to *loop*, though, AFAIK.
>>>
>>> Ordering is a problem, though.
>>>
>>> We really want to run task isolation last, so we can guarantee that
>>> all the isolation prerequisites are met (dynticks stopped, per-cpu lru
>>> cache empty, etc).  But achieving that state can require enabling
>>> interrupts - most obviously if we have to schedule, e.g. for vmstat
>>> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
>>> just while waiting for that last dyntick interrupt to occur.  I'm also
>>> not sure that even something as simple as draining the per-cpu lru
>>> cache can be done holding interrupts disabled throughout - certainly
>>> there's a !SMP code path there that just re-enables interrupts
>>> unconditionally, which gives me pause.
>>>
>>> At any rate at that point you need to retest for signals, resched,
>>> etc, all as usual, and then you need to recheck the task isolation
>>> prerequisites once more.
>>>
>>> I may be missing something here, but it's really not obvious to me
>>> that there's a way to do this without having task isolation integrated
>>> into the usual return-to-userspace loop.
>>>
>> What if we did it the other way around: set a percpu flag saying
>> "going quiescent; disallow new deferred work", then finish all
>> existing work and return to userspace.  Then, on the next entry, clear
>> that flag.  With the flag set, vmstat would just flush anything that
>> it accumulates immediately, nothing would be added to the LRU list,
>> etc.
>
>
> This is an interesting idea!
>
> However, there are a number of implementation ideas that make me
> worry that it might be a trickier approach overall.
>
> First, "on the next entry" hides a world of hurt in four simple words.
> Some platforms (arm64 and tile, that I'm familiar with) have a common
> chunk of code that always runs on every entry to the kernel.  It would
> not be too hard to poke at the assembly and make those platforms
> always run some task-isolation specific code on entry.  But x86 scares
> me - there seem to be a whole lot of ways to get into the kernel, and
> I'm not convinced there is a lot of shared macrology or whatever that
> would make it straightforward to intercept all of them.

Just use the context tracking entry hook.  It's 100% reliable.  The
relevant x86 function is enter_from_user_mode(), but I would just hook
into user_exit() in the common code.  (This code is had better be
reliable, because context tracking depends on it, and, if context
tracking doesn't work on a given arch, then isolation isn't going to
work regardless.

>
> Then, there are the two actual subsystems in question.  It looks like
> we could intercept LRU reasonably cleanly by hooking pagevec_add()
> is to return zero when we are in this "going quiescent" mode, and that
> would keep the per-cpu vectors empty.  The vmstat stuff is a little
> trickier since all the existing code is built around updating the per-cpu
> stuff and then only later copying it off to the global state.  I suppose
> we could add a test-and-flush at the end of every public API and not
> worry about the implementation cost.

Seems reasonable to me.  If anyone cares about the performance hit,
they can fix it.

>
> But it does seem like we are adding noticeable maintenance cost on
> the mainline kernel to support task isolation by doing this.  My guess
> is that it is easier to support the kind of "are you clean?" / "get clean"
> APIs for subsystems, rather than weaving a whole set of "stay clean"
> mechanism into each subsystem.

My intuition is that it's the other way around.  For the mainline
kernel, having a nice clean well-integrated implementation is nicer
than having a bolted-on implementation that interacts in potentially
complicated ways.  Once quiescence support is in mainline, the size of
the diff or the degree to which it's scattered around is irrelevant
because it's not a diff any more.

>
> So to pop up a level, what is your actual concern about the existing
> "do it in a loop" model?  The macrology current

Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/30/2016 2:43 PM, Andy Lutomirski wrote:

On Aug 30, 2016 10:02 AM, "Chris Metcalf"  wrote:

On 8/30/2016 12:30 PM, Andy Lutomirski wrote:

On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf  wrote:

The basic idea is just that we don't want to be at risk from the
dyntick getting enabled.  Similarly, we don't want to be at risk of a
later global IPI due to lru_add_drain stuff, for example.  And, we may
want to add additional stuff, like catching kernel TLB flushes and
deferring them when a remote core is in userspace.  To do all of this
kind of stuff, we need to run in the return to user path so we are
late enough to guarantee no further kernel things will happen to
perturb our carefully-arranged isolation state that includes dyntick
off, per-cpu lru cache empty, etc etc.

None of the above should need to *loop*, though, AFAIK.

Ordering is a problem, though.

We really want to run task isolation last, so we can guarantee that
all the isolation prerequisites are met (dynticks stopped, per-cpu lru
cache empty, etc).  But achieving that state can require enabling
interrupts - most obviously if we have to schedule, e.g. for vmstat
clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
just while waiting for that last dyntick interrupt to occur.  I'm also
not sure that even something as simple as draining the per-cpu lru
cache can be done holding interrupts disabled throughout - certainly
there's a !SMP code path there that just re-enables interrupts
unconditionally, which gives me pause.

At any rate at that point you need to retest for signals, resched,
etc, all as usual, and then you need to recheck the task isolation
prerequisites once more.

I may be missing something here, but it's really not obvious to me
that there's a way to do this without having task isolation integrated
into the usual return-to-userspace loop.


What if we did it the other way around: set a percpu flag saying
"going quiescent; disallow new deferred work", then finish all
existing work and return to userspace.  Then, on the next entry, clear
that flag.  With the flag set, vmstat would just flush anything that
it accumulates immediately, nothing would be added to the LRU list,
etc.


This is an interesting idea!

However, there are a number of implementation ideas that make me
worry that it might be a trickier approach overall.

First, "on the next entry" hides a world of hurt in four simple words.
Some platforms (arm64 and tile, that I'm familiar with) have a common
chunk of code that always runs on every entry to the kernel.  It would
not be too hard to poke at the assembly and make those platforms
always run some task-isolation specific code on entry.  But x86 scares
me - there seem to be a whole lot of ways to get into the kernel, and
I'm not convinced there is a lot of shared macrology or whatever that
would make it straightforward to intercept all of them.

Then, there are the two actual subsystems in question.  It looks like
we could intercept LRU reasonably cleanly by hooking pagevec_add()
to return zero when we are in this "going quiescent" mode, and that
would keep the per-cpu vectors empty.  The vmstat stuff is a little
trickier since all the existing code is built around updating the per-cpu
stuff and then only later copying it off to the global state.  I suppose
we could add a test-and-flush at the end of every public API and not
worry about the implementation cost.

But it does seem like we are adding noticeable maintenance cost on
the mainline kernel to support task isolation by doing this.  My guess
is that it is easier to support the kind of "are you clean?" / "get clean"
APIs for subsystems, rather than weaving a whole set of "stay clean"
mechanism into each subsystem.

So to pop up a level, what is your actual concern about the existing
"do it in a loop" model?  The macrology currently in use means there
is zero cost if you don't configure TASK_ISOLATION, and the software
maintenance cost seems low since the idioms used for task isolation
in the loop are generally familiar to people reading that code.


Also, this cond_resched stuff doesn't worry me too much at a
fundamental level -- if we're really going quiescent, shouldn't we be
able to arrange that there are no other schedulable tasks on the CPU
in question?


We aren't currently planning to enforce things in the scheduler, so if
the application affinitizes another task on top of an existing task
isolation task, by default the task isolation task just dies. (Unless
it's using NOSIG mode, in which case it just ends up stuck in the
kernel trying to wait out the dyntick until you either kill it, or
re-affinitize the offending task.)  But I'm reluctant to guarantee
every possible way that you might (perhaps briefly) have some
schedulable task, and the current approach seems pretty robust if that
sort of thing happens.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux

Re: [PATCH V4] leds: trigger: Introduce an USB port trigger

2016-08-30 Thread Rafał Miłecki
On 30 August 2016 at 14:05, Greg KH  wrote:
> On Fri, Aug 26, 2016 at 05:38:05PM +0200, Rafał Miłecki wrote:
>> On 25 August 2016 at 14:49, Greg KH  wrote:
>> > On Thu, Aug 25, 2016 at 10:03:52AM +0200, Rafał Miłecki wrote:
>> >> +static void usbport_trig_activate(struct led_classdev *led_cdev)
>> >> +{
>> >> + struct usbport_trig_data *usbport_data;
>> >> + int err;
>> >> +
>> >> + usbport_data = kzalloc(sizeof(*usbport_data), GFP_KERNEL);
>> >> + if (!usbport_data)
>> >> + return;
>> >> + usbport_data->led_cdev = led_cdev;
>> >> +
>> >> + /* Storing ports */
>> >> + INIT_LIST_HEAD(&usbport_data->ports);
>> >> + usbport_data->ports_dir = kobject_create_and_add("ports",
>> >> +  
>> >> &led_cdev->dev->kobj);
>> >
>> > If you _ever_ find yourself in a driver having to use kobject calls,
>> > it's a HUGE hint that you are doing something wrong.
>> >
>> > Hint, you are doing this wrong :)
>> >
>> > Use an attribute group if you need a subdirectory in sysfs, using a
>> > "raw" kobject like this hides it from all userspace tools and so no
>> > userspace program can see it (hint, try using libudev to access these
>> > files attached to the device...)
>>
>> Oops. Thanks for pointing groups to me. I was looking at sysfs.h
>> earlier but I didn't realize group can be a subdirectory. I can see
>> these sysfs_create_group(s) and friends now, thanks.
>
> No, use an attribute group, and name it, and the subdir will be created
> automagically for you.

Oh, so I got it wrong again. Thanks for explaining.


>> >> + if (!usbport_data->ports_dir)
>> >> + goto err_free;
>> >> +
>> >> + /* API for ports management */
>> >> + err = device_create_file(led_cdev->dev, &dev_attr_new_port);
>> >> + if (err)
>> >> + goto err_put_ports;
>> >> + err = device_create_file(led_cdev->dev, &dev_attr_remove_port);
>> >> + if (err)
>> >> + goto err_remove_new_port;
>> >
>> > Doesn't this race with userspace and fail?  Shouldn't the led core be
>> > creating your leds for you?
>>
>> These questions aren't clear to me. What kind of race? Doing
>> echo usbport > trigger
>> results in trigger core calling usbport_trig_activate. We create new
>> attributes and then we return.
>
> Why aren't the attributes present when the device is found?  What is
> "trigger" for?
>
>> I'm also not creating any leds there. This already has to be LED
>> available if you want to assign some trigger to it.
>
> That sounds really odd...

Please take a look at Documentation to get some idea of LED triggers:
Documentation/leds/leds-class.txt

Basically a LED (/sys/class/leds/foo/) can be controller with
"brightness" sysfs file like this:
echo 0 > brightness
echo 5 > brightness
echo 255 > brightness
(most LEDs react only 0 and non-0 values).

As you quite often need more complex LED management, there are
triggers that were introduced in 2006 by c3bc9956ec52f ("[PATCH] LED:
add LED trigger tupport"). Some triggers are trivial and could be
implemented in userspace as well (e.g. "timer"). Some had to be
implemented in kernelspace (CPU activity, MTD activity, etc.). Having
few triggers compiled, you can assign them to LEDs at it pleases you.
Your hardware may have generic LED (not labeled) and you can
dynamically assign various triggers to it, depending e.g. on user
actions. E.g. if user (using GUI or whatever) wants to see flash
activity, your userspace script should do:
echo mtd > /sys/class/leds/foo/trigger

I hope it explains things a big and you can see know why trigger code
is independent of creating LEDs. It's about layers and making things
generic I believe.

There is a place for LED driver that is responsible for creating
"struct led_classdev" with proper callback setting brightness. It
provides LED name, but is unaware of the way it should be
lighted/turned off.
There is LED subsystem.
There are triggers that are unaware of LED hardware details and only
set LED brightness using generic API.

I'm obviously not author of all of that and I can't explain some
design decisions, but I hope I gave you a clue how it works.


>> >> +
>> >> + /* Notifications */
>> >> + usbport_data->nb.notifier_call = usbport_trig_notify,
>> >> + led_cdev->trigger_data = usbport_data;
>> >> + usb_register_notify(&usbport_data->nb);
>> >
>> > Don't abuse the USB notifier chain with stuff like this please, is that
>> > really necessary?  Why can't your hardware just tell you what USB ports
>> > are in use "out of band"?
>>
>> I totally don't understand this part. This LED trigger is supposed to
>> react to USB devices appearing at specified ports. How else can I know
>> there is a new USB device if not by notifications?
>> I'm wondering if we are on the same page. Could it be my idea of this
>> LED trigger is not clear at all? Could you take a look at commit
>> message again, please? Mostly this part:
>> > This commit adds

Re: [PATCH V4] leds: trigger: Introduce an USB port trigger

2016-08-30 Thread Alan Stern
On Tue, 30 Aug 2016, Rafał Miłecki wrote:

> Please take a look at Documentation to get some idea of LED triggers:
> Documentation/leds/leds-class.txt
> 
> Basically a LED (/sys/class/leds/foo/) can be controller with
> "brightness" sysfs file like this:
> echo 0 > brightness
> echo 5 > brightness
> echo 255 > brightness
> (most LEDs react only 0 and non-0 values).
> 
> As you quite often need more complex LED management, there are
> triggers that were introduced in 2006 by c3bc9956ec52f ("[PATCH] LED:
> add LED trigger tupport"). Some triggers are trivial and could be
> implemented in userspace as well (e.g. "timer"). Some had to be
> implemented in kernelspace (CPU activity, MTD activity, etc.). Having
> few triggers compiled, you can assign them to LEDs at it pleases you.
> Your hardware may have generic LED (not labeled) and you can
> dynamically assign various triggers to it, depending e.g. on user
> actions. E.g. if user (using GUI or whatever) wants to see flash
> activity, your userspace script should do:
> echo mtd > /sys/class/leds/foo/trigger

So for example, you might want to do:

echo usb1-4 >/sys/class/leds/foo/trigger

and then have the "foo" LED toggle whenever an URB was submitted or 
completed for a device attached to the 1-4 port.  Right?

> I hope it explains things a big and you can see know why trigger code
> is independent of creating LEDs. It's about layers and making things
> generic I believe.
> 
> There is a place for LED driver that is responsible for creating
> "struct led_classdev" with proper callback setting brightness. It
> provides LED name, but is unaware of the way it should be
> lighted/turned off.
> There is LED subsystem.
> There are triggers that are unaware of LED hardware details and only
> set LED brightness using generic API.
> 
> I'm obviously not author of all of that and I can't explain some
> design decisions, but I hope I gave you a clue how it works.

> >> >> + /* Notifications */
> >> >> + usbport_data->nb.notifier_call = usbport_trig_notify,
> >> >> + led_cdev->trigger_data = usbport_data;
> >> >> + usb_register_notify(&usbport_data->nb);
> >> >
> >> > Don't abuse the USB notifier chain with stuff like this please, is that
> >> > really necessary?  Why can't your hardware just tell you what USB ports
> >> > are in use "out of band"?
> >>
> >> I totally don't understand this part. This LED trigger is supposed to
> >> react to USB devices appearing at specified ports. How else can I know
> >> there is a new USB device if not by notifications?
> >> I'm wondering if we are on the same page. Could it be my idea of this
> >> LED trigger is not clear at all? Could you take a look at commit
> >> message again, please? Mostly this part:
> >> > This commit adds a new trigger responsible for turning on LED when USB
> >> > device gets connected to the specified USB port. This can can useful for
> >> > various home routers that have USB port(s) and a proper LED telling user
> >> > a device is connected.
> >>
> >> Can I add something more to make it clear what this trigger is responsible 
> >> for?
> >
> > Yes please, as I totally missed that.
> >
> > And the USB notifier was created for some "special" parts of the USB
> > core to use, this feels like an abuse of that in some way.  But it's
> > hard to define, I know...
> 
> I didn't mean to abuse this USB notifier, can you think of any other
> way to achieve the same result? We could think about modifying USB
> core to call some symbol from the trigger driver (see usage of
> ledtrig_mtd_activity symbol) but is it any better than using
> notification block?

I don't think this is such a bad use of the USB notifier.  All you need 
to know is when a device is attached to or unplugged from an LED's 
port.  Neither of these is a very frequent event.

However, there is still the question of how to know when an URB is 
submitted or completed for the device in question.  Obviously the USB 
core would have to call an LED routine.  But how would you determine 
which LED(s) to toggle?  Go through the entire list, looking for ones 
that are bound to the USB device in question?  This seems inefficient.  
Use a hash table?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] leds: trigger: Introduce an USB port trigger

2016-08-30 Thread Rafał Miłecki
On 30 August 2016 at 22:54, Alan Stern  wrote:
> On Tue, 30 Aug 2016, Rafał Miłecki wrote:
>
>> Please take a look at Documentation to get some idea of LED triggers:
>> Documentation/leds/leds-class.txt
>>
>> Basically a LED (/sys/class/leds/foo/) can be controller with
>> "brightness" sysfs file like this:
>> echo 0 > brightness
>> echo 5 > brightness
>> echo 255 > brightness
>> (most LEDs react only 0 and non-0 values).
>>
>> As you quite often need more complex LED management, there are
>> triggers that were introduced in 2006 by c3bc9956ec52f ("[PATCH] LED:
>> add LED trigger tupport"). Some triggers are trivial and could be
>> implemented in userspace as well (e.g. "timer"). Some had to be
>> implemented in kernelspace (CPU activity, MTD activity, etc.). Having
>> few triggers compiled, you can assign them to LEDs at it pleases you.
>> Your hardware may have generic LED (not labeled) and you can
>> dynamically assign various triggers to it, depending e.g. on user
>> actions. E.g. if user (using GUI or whatever) wants to see flash
>> activity, your userspace script should do:
>> echo mtd > /sys/class/leds/foo/trigger
>
> So for example, you might want to do:
>
> echo usb1-4 >/sys/class/leds/foo/trigger
>
> and then have the "foo" LED toggle whenever an URB was submitted or
> completed for a device attached to the 1-4 port.  Right?

Not really as it won't cover some pretty common use cases. Many home
routers have few USB ports (2-5) and only 1 USB LED. It has to be
possible to assign few USB ports to a single LED (trigger). That way
LED should be turned on (and kept on) if there is at least 1 USB
device connected. You obviously can't do:
echo "usb1-1 usb1-2 usb2-1" > /sys/class/leds/foo/trigger

This was already brought up by Rob (who mentioned CPU trigger) and I
replied him pretty much the same way in:
https://lkml.org/lkml/2016/7/29/38
(reply starts with "Anyway, the serious limitation I see").


>> I hope it explains things a big and you can see know why trigger code
>> is independent of creating LEDs. It's about layers and making things
>> generic I believe.
>>
>> There is a place for LED driver that is responsible for creating
>> "struct led_classdev" with proper callback setting brightness. It
>> provides LED name, but is unaware of the way it should be
>> lighted/turned off.
>> There is LED subsystem.
>> There are triggers that are unaware of LED hardware details and only
>> set LED brightness using generic API.
>>
>> I'm obviously not author of all of that and I can't explain some
>> design decisions, but I hope I gave you a clue how it works.
>
>> >> >> + /* Notifications */
>> >> >> + usbport_data->nb.notifier_call = usbport_trig_notify,
>> >> >> + led_cdev->trigger_data = usbport_data;
>> >> >> + usb_register_notify(&usbport_data->nb);
>> >> >
>> >> > Don't abuse the USB notifier chain with stuff like this please, is that
>> >> > really necessary?  Why can't your hardware just tell you what USB ports
>> >> > are in use "out of band"?
>> >>
>> >> I totally don't understand this part. This LED trigger is supposed to
>> >> react to USB devices appearing at specified ports. How else can I know
>> >> there is a new USB device if not by notifications?
>> >> I'm wondering if we are on the same page. Could it be my idea of this
>> >> LED trigger is not clear at all? Could you take a look at commit
>> >> message again, please? Mostly this part:
>> >> > This commit adds a new trigger responsible for turning on LED when USB
>> >> > device gets connected to the specified USB port. This can can useful for
>> >> > various home routers that have USB port(s) and a proper LED telling user
>> >> > a device is connected.
>> >>
>> >> Can I add something more to make it clear what this trigger is 
>> >> responsible for?
>> >
>> > Yes please, as I totally missed that.
>> >
>> > And the USB notifier was created for some "special" parts of the USB
>> > core to use, this feels like an abuse of that in some way.  But it's
>> > hard to define, I know...
>>
>> I didn't mean to abuse this USB notifier, can you think of any other
>> way to achieve the same result? We could think about modifying USB
>> core to call some symbol from the trigger driver (see usage of
>> ledtrig_mtd_activity symbol) but is it any better than using
>> notification block?
>
> I don't think this is such a bad use of the USB notifier.  All you need
> to know is when a device is attached to or unplugged from an LED's
> port.  Neither of these is a very frequent event.
>
> However, there is still the question of how to know when an URB is
> submitted or completed for the device in question.  Obviously the USB
> core would have to call an LED routine.  But how would you determine
> which LED(s) to toggle?  Go through the entire list, looking for ones
> that are bound to the USB device in question?  This seems inefficient.
> Use a hash table?

I was hoping to bring it to discussion later, as it seems even the
basic version of t

Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/30/2016 1:10 PM, Frederic Weisbecker wrote:

On Tue, Aug 30, 2016 at 11:41:36AM -0400, Chris Metcalf wrote:

On 8/29/2016 8:55 PM, Frederic Weisbecker wrote:

On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:

On 8/11/2016 2:50 PM, Christoph Lameter wrote:

On Thu, 11 Aug 2016, Frederic Weisbecker wrote:


Do we need to quiesce vmstat everytime before entering userspace?
I thought that vmstat only need to be offlined once and for all?

Once is sufficient after disabling the tick.

It's true that task_isolation_enter() is called every time before
returning to user space while task isolation is enabled.

But once we enter the kernel again after returning from the initial
prctl() -- assuming we are in NOSIG mode so doing so is legal in the
first place -- almost anything can happen, certainly including
restarting the tick.  Thus, we have to make sure that normal quiescing
happens again before we return to userspace.

Yes but we need to sort out what needs to be called only once on prctl().

Once vmstat is quiesced, it's not going to need quiescing again even if we
restart the tick.

That's true, but I really do like the idea of having a clean structure
where we verify all our prerequisites in task_isolation_ready(), and
have code to try to get things fixed up in task_isolation_enter().
If we start moving some things here and some things there, it gets
harder to manage.  I think by testing "!vmstat_idle()" in
task_isolation_enter() we are avoiding any substantial overhead.

I think that making the code clearer on what needs to be done once for
all on prctl() and what needs to be done on all actual syscall return
is more important for readability.


We don't need to just do it on prctl(), though.  We also need to do
it whenever we have been in the kernel for another reason, which
can happen with NOSIG.  So we need to do this on the common return
to userspace path no matter what, right?  Or am I missing something?
It seems like if, for example, we do mmaps or page faults, then on return
to userspace, some of those counters will have been incremented and
we'll need to run the quiet_vmstat_sync() code.


+   if (!tick_nohz_tick_stopped())
+   set_tsk_need_resched(current);
Again, that won't help

It won't be better than spinning in a loop if there aren't any other
schedulable processes, but it won't be worse either.  If there is
another schedulable process, we at least will schedule it sooner than
if we just sat in a busy loop and waited for the scheduler to kick
us. But there's nothing else we can do anyway if we want to maintain
the guarantee that the dyn tick is stopped before return to userspace.

I don't think it helps either way. If reschedule is pending, the current
task already has TIF_RESCHED set.

See the other thread with Peter Z for the longer discussion of this.
At this point I'm leaning towards replacing the set_tsk_need_resched() with

 set_current_state(TASK_INTERRUPTIBLE);
 schedule();
 __set_current_state(TASK_RUNNING);

I don't see how that helps. What will wake the thread up except a signal?


The answer is that the scheduler will keep bringing us back to this
point (either after running another runnable task if there is one,
or else just returning to this point immediately without doing a
context switch), and we will then go around the "prepare exit to
userspace" loop and perhaps discover that enough time has gone
by that the last dyntick interrupt has triggered and the kernel has
quiesced the dynticks.  At that point we stop calling schedule()
over and over and can return normally to userspace.

It's very counter-intuitive to burn cpu time intentionally in the kernel.
I really don't see another way to resolve this, though.  I don't think
it would be safe, for example, to just promote the next dyntick to
running immediately (rather than waiting a few microseconds until
it is scheduled to go off).

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] docs-rst: improve typedef parser

2016-08-30 Thread Mauro Carvalho Chehab
Improve the parser to handle typedefs like:

typedef bool v4l2_check_dv_timings_fnc(const struct v4l2_dv_timings *t, 
void *handle);

Signed-off-by: Mauro Carvalho Chehab 
---
 scripts/kernel-doc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/scripts/kernel-doc b/scripts/kernel-doc
index bac0af4fc659..d94870270d8e 100755
--- a/scripts/kernel-doc
+++ b/scripts/kernel-doc
@@ -2191,7 +2191,9 @@ sub dump_typedef($$) {
 $x =~ s@/\*.*?\*/@@gos;# strip comments.
 
 # Parse function prototypes
-if ($x =~ /typedef\s+(\w+)\s*\(\*\s*(\w\S+)\s*\)\s*\((.*)\);/) {
+if ($x =~ /typedef\s+(\w+)\s*\(\*\s*(\w\S+)\s*\)\s*\((.*)\);/ ||
+   $x =~ /typedef\s+(\w+)\s*(\w\S+)\s*\s*\((.*)\);/) {
+
# Function typedefs
$return_type = $1;
$declaration_name = $2;
-- 
2.7.4


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Fix kernel-doc parser for typedef functions

2016-08-30 Thread Mauro Carvalho Chehab
The kernel-doc parser has two issues when handling typedef functions:

1) its parse is incomplete;
2) it causes warnings when a typedef is used as a function argument.

This series partially addresses (1), adding one extra syntax for the parser.
I'm pretty sure that the parser is still incomplete and that we'll get some 
other
places where it fails.

Jon,

My plan is to apply the last patch on my tree, together with a series of
patches that I'm writing to fix the warnings on nitpick mode.

The other two patches better fit on your tree, IMHO.

Mauro Carvalho Chehab (3):
  docs-rst: improve typedef parser
  docs-rst: kernel-doc: fix typedef output in RST format
  [media] v4l2-dv-timings.h: let kernel-doc parte the typedef argument

 include/media/v4l2-dv-timings.h |  4 ++--
 scripts/kernel-doc  | 36 ++--
 2 files changed, 28 insertions(+), 12 deletions(-)

-- 
2.7.4


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] docs-rst: kernel-doc: fix typedef output in RST format

2016-08-30 Thread Mauro Carvalho Chehab
When using a typedef function like this one:
typedef bool v4l2_check_dv_timings_fnc (const struct v4l2_dv_timings * 
t, void * handle);

The Sphinx C domain expects it to create a c:type: reference,
as that's the way it creates the type references when parsing
a c:function:: declaration.

So, a declaration like:

.. c:function:: bool v4l2_valid_dv_timings (const struct 
v4l2_dv_timings * t, const struct v4l2_dv_timings_cap * cap, 
v4l2_check_dv_timings_fnc fnc, void * fnc_handle)

Will create a cross reference for :c:type:`v4l2_check_dv_timings_fnc`.

So, when outputting such typedefs in RST format, we need to handle
this special case, as otherwise it will produce those warnings:

./include/media/v4l2-dv-timings.h:43: WARNING: c:type reference target 
not found: v4l2_check_dv_timings_fnc
./include/media/v4l2-dv-timings.h:60: WARNING: c:type reference target 
not found: v4l2_check_dv_timings_fnc
./include/media/v4l2-dv-timings.h:81: WARNING: c:type reference target 
not found: v4l2_check_dv_timings_fnc

So, change the kernel-doc script to produce a RST output for the
above typedef as:
.. c:type:: v4l2_check_dv_timings_fnc

   **Typedef**: timings check callback

**Syntax**

  ``bool v4l2_check_dv_timings_fnc (const struct v4l2_dv_timings * t, 
void * handle);``

Signed-off-by: Mauro Carvalho Chehab 
---
 scripts/kernel-doc | 32 +++-
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/scripts/kernel-doc b/scripts/kernel-doc
index d94870270d8e..091e49167b44 100755
--- a/scripts/kernel-doc
+++ b/scripts/kernel-doc
@@ -1831,13 +1831,22 @@ sub output_function_rst(%) {
 my %args = %{$_[0]};
 my ($parameter, $section);
 my $oldprefix = $lineprefix;
-my $start;
+my $start = "";
 
-print ".. c:function:: ";
+if ($args{'typedef'}) {
+   print ".. c:type:: ". $args{'function'} . "\n\n";
+   print_lineno($declaration_start_line);
+   print "   **Typedef**: ";
+   $lineprefix = "";
+   output_highlight_rst($args{'purpose'});
+   $start = "\n\n**Syntax**\n\n  ``";
+} else {
+   print ".. c:function:: ";
+}
 if ($args{'functiontype'} ne "") {
-   $start = $args{'functiontype'} . " " . $args{'function'} . " (";
+   $start .= $args{'functiontype'} . " " . $args{'function'} . " (";
 } else {
-   $start = $args{'function'} . " (";
+   $start .= $args{'function'} . " (";
 }
 print $start;
 
@@ -1857,11 +1866,15 @@ sub output_function_rst(%) {
$count++;
}
 }
-print ")\n\n";
-print_lineno($declaration_start_line);
-$lineprefix = "   ";
-output_highlight_rst($args{'purpose'});
-print "\n";
+if ($args{'typedef'}) {
+   print ");``\n\n";
+} else {
+   print ")\n\n";
+   print_lineno($declaration_start_line);
+   $lineprefix = "   ";
+   output_highlight_rst($args{'purpose'});
+   print "\n";
+}
 
 print "**Parameters**\n\n";
 $lineprefix = "  ";
@@ -2204,6 +2217,7 @@ sub dump_typedef($$) {
output_declaration($declaration_name,
   'function',
   {'function' => $declaration_name,
+   'typedef' => 1,
'module' => $modulename,
'functiontype' => $return_type,
'parameterlist' => \@parameterlist,
-- 
2.7.4


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html