Re: [PATCH v2 1/1] x86/acpi: Use FADT flags to determine the PMTMR width

2020-06-22 Thread Jan Beulich
On 19.06.2020 20:23, Grzegorz Uriasz wrote:
> On 19/06/2020 15:17, Jan Beulich wrote:
>> On 19.06.2020 06:28, Grzegorz Uriasz wrote:
>>> --- a/xen/arch/x86/time.c
>>> +++ b/xen/arch/x86/time.c
>>> @@ -457,16 +457,16 @@ static u64 read_pmtimer_count(void)
>>>  static s64 __init init_pmtimer(struct platform_timesource *pts)
>>>  {
>>>  u64 start;
>>> -u32 count, target, mask = 0xff;
>>> +u32 count, target, mask;
>>>  
>>> -if ( !pmtmr_ioport || !pmtmr_width )
>>> +if ( !pmtmr_ioport )
>>>  return 0;
>>>  
>>> -if ( pmtmr_width == 32 )
>>> -{
>>> -pts->counter_bits = 32;
>>> -mask = 0x;
>>> -}
>>> +if ( pmtmr_width != 24 && pmtmr_width != 32 )
>>> +return 0;
>>> +
>>> +pts->counter_bits = (int) pmtmr_width;
>>> +mask = (1ull << pmtmr_width) - 1;
>>>  
>>>  count = inl(pmtmr_ioport) & mask;
>>>  start = rdtsc_ordered();
>>> @@ -486,7 +486,6 @@ static struct platform_timesource __initdata 
>>> plt_pmtimer =
>>>  .name = "ACPI PM Timer",
>>>  .frequency = ACPI_PM_FREQUENCY,
>>>  .read_counter = read_pmtimer_count,
>>> -.counter_bits = 24,
>>>  .init = init_pmtimer
>>>  };
>> I'm struggling a little to see why this code churn is needed / wanted.
>> The change I had suggested was quite a bit less intrusive. I'm not
>> entirely opposed though, but at the very least please drop the odd
>> (int) cast. If anything we want the struct field changed to unsigned
>> int (but unlikely in this very patch).
>>
>> If you want to stick to this larger change, then also please fold the
>> two if()s at the beginning of the function.
> 
> I wanted to avoid hard coding the masks -> Linux has a nice macro for
> generating the masks but I haven't found a similar macro in xen.
> Additionally this version sets the counter width in a single place
> instead of two.

I guessed this was the goal, but I think such adjustments, if indeed
wanted, would better go into a separate patch. The bug fix here wants
backporting, while this extra cleanup probably doesn't.

Jan



Re: [PATCH v2 3/7] x86/vmx: add IPT cpu feature

2020-06-22 Thread Jan Beulich
On 22.06.2020 04:49, Michał Leszczyński wrote:
> - 19 cze 2020 o 15:44, Roger Pau Monné roger@citrix.com napisał(a):
> 
>> On Fri, Jun 19, 2020 at 01:40:21AM +0200, Michał Leszczyński wrote:
>>> Check if Intel Processor Trace feature is supported by current
>>> processor. Define hvm_ipt_supported function.
>>>
>>> Signed-off-by: Michal Leszczynski 
>>> ---
>>
>> We usually keep a shirt list of the changes between versions, so it's
>> easier for the reviewers to know what changed. As an example:
>>
>> https://lore.kernel.org/xen-devel/20200613184132.11880-1-jul...@xen.org/
>>
>>>  xen/arch/x86/hvm/vmx/vmcs.c | 4 
>>>  xen/include/asm-x86/cpufeature.h| 1 +
>>>  xen/include/asm-x86/hvm/hvm.h   | 9 +
>>>  xen/include/asm-x86/hvm/vmx/vmcs.h  | 1 +
>>>  xen/include/public/arch-x86/cpufeatureset.h | 1 +
>>>  5 files changed, 16 insertions(+)
>>>
>>> diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
>>> index ca94c2bedc..8466ccb912 100644
>>> --- a/xen/arch/x86/hvm/vmx/vmcs.c
>>> +++ b/xen/arch/x86/hvm/vmx/vmcs.c
>>> @@ -315,6 +315,10 @@ static int vmx_init_vmcs_config(void)
>>>  if ( opt_ept_pml )
>>>  opt |= SECONDARY_EXEC_ENABLE_PML;
>>>  
>>> +/* Check whether IPT is supported in VMX operation */
>>> +hvm_funcs.pt_supported = cpu_has_ipt &&
>>> +( _vmx_misc_cap & VMX_MISC_PT_SUPPORTED );
>>
>> By the placement of this chunk you are tying IPT support to the
>> secondary exec availability, but I don't think that's required?
>>
>> Ie: You should move the read of misc_cap to the top-level of the
>> function and perform the VMX_MISC_PT_SUPPORTED check there also.
>>
>> Note that space inside parentheses is only required for conditions of
>> 'if', 'for' and those kind of statements, here it's not required, so
>> this should be:
>>
>>hvm_funcs.pt_supported = cpu_has_ipt &&
>> (_vmx_misc_cap & VMX_MISC_PT_SUPPORTED);
>>
>> I also think this should look like:
>>
>>if ( !smp_processor_id() )
>>  hvm_funcs.pt_supported = cpu_has_ipt &&
>> (_vmx_misc_cap & VMX_MISC_PT_SUPPORTED);
>>else if ( hvm_funcs.pt_supported &&
>>  !(_vmx_misc_cap & VMX_MISC_PT_SUPPORTED) )
>>{
>>printk("VMX: IPT capabilities fatally differ between CPU%u and 
>> CPU0\n",
>>   smp_processor_id());
>>return -EINVAL;
>>}
>>
>>
>> So that you can detect mismatches between CPUs.
> 
> 
> I'm afraid this snippet doesn't work. All CPUs read hvm_funcs.pt_supported as 
> 0 even when it was set to 1 for CPU=0. I'm not sure if this is some 
> multithreading issue or there is a separate hvm_funcs for each CPU?

hvm_funcs will be set from start_vmx()'s return value, i.e. vmx_function_table.
Therefore at least prior to start_vmx() completing you need to fiddle with
vmx_function_table, not hvm_funcs. And in the code here, where for APs you
only read the value, you could easily use the former uniformly, I think.

Jan



Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-22 Thread Roger Pau Monné
On Fri, Jun 19, 2020 at 11:43:12PM +, Anchal Agarwal wrote:
> On Wed, Jun 17, 2020 at 10:35:28AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > On Tue, Jun 16, 2020 at 09:49:25PM +, Anchal Agarwal wrote:
> > > On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monné wrote:
> > > > CAUTION: This email originated from outside of the organization. Do not 
> > > > click links or open attachments unless you can confirm the sender and 
> > > > know the content is safe.
> > > > On Wed, Jun 03, 2020 at 11:33:52PM +, Agarwal, Anchal wrote:
> > > > >  CAUTION: This email originated from outside of the organization. Do 
> > > > > not click links or open attachments unless you can confirm the sender 
> > > > > and know the content is safe.
> > > > > > + xenbus_dev_error(dev, err, "Freezing timed out;"
> > > > > > +  "the device may become 
> > > > > inconsistent state");
> > > > >
> > > > > Leaving the device in this state is quite bad, as it's in a closed
> > > > > state and with the queues frozen. You should make an attempt to
> > > > > restore things to a working state.
> > > > >
> > > > > You mean if backend closed after timeout? Is there a way to know 
> > > > > that? I understand it's not good to
> > > > > leave it in this state however, I am still trying to find if there is 
> > > > > a good way to know if backend is still connected after timeout.
> > > > > Hence the message " the device may become inconsistent state".  I 
> > > > > didn't see a timeout not even once on my end so that's why
> > > > > I may be looking for an alternate perspective here. may be need to 
> > > > > thaw everything back intentionally is one thing I could think of.
> > > >
> > > > You can manually force this state, and then check that it will behave
> > > > correctly. I would expect that on a failure to disconnect from the
> > > > backend you should switch the frontend to the 'Init' state in order to
> > > > try to reconnect to the backend when possible.
> > > >
> > > From what I understand forcing manually is, failing the freeze without
> > > disconnect and try to revive the connection by unfreezing the
> > > queues->reconnecting to backend [which never got diconnected]. May be even
> > > tearing down things manually because I am not sure what state will 
> > > frontend
> > > see if backend fails to to disconnect at any point in time. I assumed 
> > > connected.
> > > Then again if its "CONNECTED" I may not need to tear down everything and 
> > > start
> > > from Initialising state because that may not work.
> > >
> > > So I am not so sure about backend's state so much, lets say if  
> > > xen_blkif_disconnect fail,
> > > I don't see it getting handled in the backend then what will be backend's 
> > > state?
> > > Will it still switch xenbus state to 'Closed'? If not what will frontend 
> > > see,
> > > if it tries to read backend's state through xenbus_read_driver_state ?
> > >
> > > So the flow be like:
> > > Front end marks XenbusStateClosing
> > > Backend marks its state as XenbusStateClosing
> > > Frontend marks XenbusStateClosed
> > > Backend disconnects calls xen_blkif_disconnect
> > >Backend fails to disconnect, the above function returns EBUSY
> > >What will be state of backend here?
> > 
> > Backend should stay in state 'Closing' then, until it can finish
> > tearing down.
> > 
> It disconnects the ring after switching to connected state too. 
> > >Frontend did not tear down the rings if backend does not switches 
> > > the
> > >state to 'Closed' in case of failure.
> > >
> > > If backend stays in CONNECTED state, then even if we mark it Initialised 
> > > in frontend, backend
> > 
> > Backend will stay in state 'Closing' I think.
> > 
> > > won't be calling connect(). {From reading code in frontend_changed}
> > > IMU, Initialising will fail since backend dev->state != XenbusStateClosed 
> > > plus
> > > we did not tear down anything so calling talk_to_blkback may not be needed
> > >
> > > Does that sound correct?
> > 
> > I think switching to the initial state in order to try to attempt a
> > reconnection would be our best bet here.
> >
> It does not seems to work correctly, I get hung tasks all over and all the
> requests to filesystem gets stuck. Backend does shows the state as connected
> after xenbus_dev_suspend fails but I think there may be something missing.
> I don't seem to get IO interrupts thereafter i.e hitting the function 
> blkif_interrupts.
> I think just marking it initialised may not be the only thing.
> Here is a short description of what I am trying to do:
> So, on timeout:
> Switch XenBusState to "Initialized"
> unquiesce/unfreeze the queues and return
> mark info->connected = BLKIF_STATE_CONNECTED

Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Jan Beulich
On 22.06.2020 04:56, Michał Leszczyński wrote:
> - 19 cze 2020 o 1:41, Michał Leszczyński michal.leszczyn...@cert.pl 
> napisał(a):
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -508,11 +508,25 @@ static void vmx_restore_host_msrs(void)
>>
>> static void vmx_save_guest_msrs(struct vcpu *v)
>> {
>> +uint64_t rtit_ctl;
>> +
>> /*
>>  * We cannot cache SHADOW_GS_BASE while the VCPU runs, as it can
>>  * be updated at any time via SWAPGS, which we cannot trap.
>>  */
>> v->arch.hvm.vmx.shadow_gs = rdgsshadow();
>> +
>> +if ( unlikely(v->arch.hvm.vmx.ipt_state &&
>> v->arch.hvm.vmx.ipt_state->active) )
>> +{
>> +smp_rmb();
>> +rdmsrl(MSR_RTIT_CTL, rtit_ctl);
>> +
>> +if ( rtit_ctl & RTIT_CTL_TRACEEN )
>> +BUG();
>> +
>> +rdmsrl(MSR_RTIT_STATUS, v->arch.hvm.vmx.ipt_state->status);
>> +rdmsrl(MSR_RTIT_OUTPUT_MASK,
>> v->arch.hvm.vmx.ipt_state->output_mask.raw);
>> +}
>> }
> 
> 
> It was suggested to move this piece of code from vm-entry/vm-exit handling to
> vmx_save_guest_msrs/vmx_restore_guest_msrs functions.
> 
> Please note that we do need to periodically read MSR_RTIT_OUTPUT_MASK in order
> to know how much data was written into the buffer by the processor. During 
> tests,
> I've spotted that in some cases vCPUs get out of scope very rarely.
> 
> For instance: when IPT gets enabled, xc_vmtrace_ipt_get_offset() is returning 
> zero
> value for the first few seconds, because it's relying on the value of
> v->arch.hvm.vmx.ipt_state->output_mask which we only read in 
> vmx_save_guest_msrs()
> and in some cases this occurs very rarely.
> 
> Could you guys suggest some solution to the problem? If we would leave this 
> value
> dirty in hardware, AFAIK we have no possibility to directly access it,
> but observing this particular value is very important in the runtime.

Much depends on what state the vCPU in question is in when you need
to "periodically read" the value. If it's paused, you may need to
invoke sync_vcpu_execstate(). If it's actively running you can't
get at the value anyway except when you're on the CPU that this vCPU
is running on (and then you can RDMSR it quite easily). Any
intermediate state between paused and running is prone to change
immediately after you've established it anyway, so you need to get
the vCPU into one of the two "stable" states in order to get at
the register.

Also (and I think I said so before) - please trim your replies.

Jan



Re: [PATCH v3 1/1] x86/acpi: Use FADT flags to determine the PMTMR width

2020-06-22 Thread Jan Beulich
On 20.06.2020 00:00, Grzegorz Uriasz wrote:
> @@ -490,8 +497,12 @@ static int __init acpi_parse_fadt(struct 
> acpi_table_header *table)
>*/
>   if (!pmtmr_ioport) {
>   pmtmr_ioport = fadt->pm_timer_block;
> - pmtmr_width = fadt->pm_timer_length == 4 ? 24 : 0;
> + pmtmr_width = fadt->pm_timer_length == 4 ? 32 : 0;
>   }
> + if (pmtmr_width < 32 && fadt->flags & ACPI_FADT_32BIT_TIMER)
> +printk(KERN_WARNING PREFIX "PM-Timer is too short\n");

Indentation is screwed up here.

> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -457,16 +457,13 @@ static u64 read_pmtimer_count(void)
>  static s64 __init init_pmtimer(struct platform_timesource *pts)
>  {
>  u64 start;
> -u32 count, target, mask = 0xff;
> +u32 count, target, mask;
>  
> -if ( !pmtmr_ioport || !pmtmr_width )
> +if ( !pmtmr_ioport || (pmtmr_width != 24 && pmtmr_width != 32) )
>  return 0;
>  
> -if ( pmtmr_width == 32 )
> -{
> -pts->counter_bits = 32;
> -mask = 0x;
> -}
> +pts->counter_bits = pmtmr_width;
> +mask = (1ull << pmtmr_width) - 1;

"mask" being of "u32" type, the use of 1ull is certainly a little odd
here, and one of the reasons why I think this extra "cleanup" would
better go in a separate patch.

Seeing that besides cosmetic aspects the patch looks okay now, I'd be
willing to leave the bigger adjustment mostly as is, albeit I'd
prefer the 1ull to go away, by instead switching to e.g.

mask = 0x >> (32 - pmtmr_width);

With the adjustments (which I'd be happy to make while committing,
provided you agree)
Reviewed-by: Jan Beulich 

Also Cc-ing Paul for a possible release ack (considering this is a
backporting candidate).

Jan



Re: [PATCH v3 1/1] x86/acpi: Use FADT flags to determine the PMTMR width

2020-06-22 Thread Roger Pau Monné
On Fri, Jun 19, 2020 at 10:00:16PM +, Grzegorz Uriasz wrote:
> On some computers the bit width of the PM Timer as reported
> by ACPI is 32 bits when in fact the FADT flags report correctly
> that the timer is 24 bits wide. On affected machines such as the
> ASUS FX504GM and never gaming laptops this results in the inability
> to resume the machine from suspend. Without this patch suspend is
> broken on affected machines and even if a machine manages to resume
> correctly then the kernel time and xen timers are trashed.
> 
> Signed-off-by: Grzegorz Uriasz 
> ---
> Cc: Jan Beulich 
> Cc: Andrew Cooper 
> Cc: Wei Liu 
> Cc: "Roger Pau Monné" 
> Cc: marma...@invisiblethingslab.com
> Cc: cont...@puzio.waw.pl
> Cc: ja...@bartmin.ski
> Cc: j.nowa...@student.uw.edu.pl
> 
> Changes in v2:
> - Check pm timer access width
> - Proper timer width is set when xpm block is not present
> - Cleanup timer initialization
> 
> Changes in v3:
> - Check pm timer bit offset
> - Warn about acpi firmware bugs
> - Drop int cast in init_pmtimer
> - Merge if's in init_pmtimer
> ---
>  xen/arch/x86/acpi/boot.c| 19 +++
>  xen/arch/x86/time.c | 12 
>  xen/include/acpi/acmacros.h |  8 
>  3 files changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/xen/arch/x86/acpi/boot.c b/xen/arch/x86/acpi/boot.c
> index bcba52e232..24fd236eb5 100644
> --- a/xen/arch/x86/acpi/boot.c
> +++ b/xen/arch/x86/acpi/boot.c
> @@ -475,10 +475,17 @@ static int __init acpi_parse_fadt(struct 
> acpi_table_header *table)
>  
>  #ifdef CONFIG_X86_PM_TIMER
>   /* detect the location of the ACPI PM Timer */
> - if (fadt->header.revision >= FADT2_REVISION_ID) {
> + if (fadt->header.revision >= FADT2_REVISION_ID &&
> + fadt->xpm_timer_block.space_id == ACPI_ADR_SPACE_SYSTEM_IO) {
>   /* FADT rev. 2 */
> - if (fadt->xpm_timer_block.space_id ==
> - ACPI_ADR_SPACE_SYSTEM_IO) {
> + if (fadt->xpm_timer_block.access_width != 0 &&
> + ACPI_ACCESS_BIT_WIDTH(fadt->xpm_timer_block.access_width) 
> != 32)
> + printk(KERN_WARNING PREFIX "PM-Timer has invalid access 
> width(%u)\n",
> +fadt->xpm_timer_block.access_width);
> + else if (fadt->xpm_timer_block.bit_offset != 0)

This should be a plain if instead of an else if, it's possible the
block has both the access_width and the bit_offset wrong.

> + printk(KERN_WARNING PREFIX "PM-Timer has invalid bit 
> offset(%u)\n",
> +fadt->xpm_timer_block.bit_offset);
> + else {
>   pmtmr_ioport = fadt->xpm_timer_block.address;
>   pmtmr_width = fadt->xpm_timer_block.bit_width;
>   }
> @@ -490,8 +497,12 @@ static int __init acpi_parse_fadt(struct 
> acpi_table_header *table)
>*/
>   if (!pmtmr_ioport) {
>   pmtmr_ioport = fadt->pm_timer_block;
> - pmtmr_width = fadt->pm_timer_length == 4 ? 24 : 0;
> + pmtmr_width = fadt->pm_timer_length == 4 ? 32 : 0;
>   }
> + if (pmtmr_width < 32 && fadt->flags & ACPI_FADT_32BIT_TIMER)
> +printk(KERN_WARNING PREFIX "PM-Timer is too short\n");

Indentation. Also this should be a fatal error likely? You should set
pmtmr_ioport = 0?

> + if (pmtmr_width > 24 && !(fadt->flags & ACPI_FADT_32BIT_TIMER))
> + pmtmr_width = 24;
>   if (pmtmr_ioport)
>   printk(KERN_INFO PREFIX "PM-Timer IO Port: %#x (%u bits)\n",
>  pmtmr_ioport, pmtmr_width);
> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> index d643590c0a..8a45514908 100644
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -457,16 +457,13 @@ static u64 read_pmtimer_count(void)
>  static s64 __init init_pmtimer(struct platform_timesource *pts)
>  {
>  u64 start;
> -u32 count, target, mask = 0xff;
> +u32 count, target, mask;
>  
> -if ( !pmtmr_ioport || !pmtmr_width )
> +if ( !pmtmr_ioport || (pmtmr_width != 24 && pmtmr_width != 32) )
>  return 0;
>  
> -if ( pmtmr_width == 32 )
> -{
> -pts->counter_bits = 32;
> -mask = 0x;
> -}
> +pts->counter_bits = pmtmr_width;
> +mask = (1ull << pmtmr_width) - 1;
>  
>  count = inl(pmtmr_ioport) & mask;
>  start = rdtsc_ordered();
> @@ -486,7 +483,6 @@ static struct platform_timesource __initdata plt_pmtimer =
>  .name = "ACPI PM Timer",
>  .frequency = ACPI_PM_FREQUENCY,
>  .read_counter = read_pmtimer_count,
> -.counter_bits = 24,
>  .init = init_pmtimer
>  };
>  
> diff --git a/xen/include/acpi/acmacros.h b/xen/include/acpi/acmacros.h
> index 6765535053..86c503c20f 100644
> --- a/xen/include/acpi/acmacros.h
> +++ b/xen/include/acpi/acmacros.h
> @@ -121,6 +121,14 @@
>  #define ACPI_COMPARE_NAME(a,b)  (!ACPI_STRNCMP (ACPI_CAST_PTR 
> (char,(a)), ACPI_CAST_PTR (char,(b)

Re: [PATCH v3 1/1] x86/acpi: Use FADT flags to determine the PMTMR width

2020-06-22 Thread Roger Pau Monné
On Mon, Jun 22, 2020 at 10:54:00AM +0200, Roger Pau Monné wrote:
> On Fri, Jun 19, 2020 at 10:00:16PM +, Grzegorz Uriasz wrote:
> > On some computers the bit width of the PM Timer as reported
> > by ACPI is 32 bits when in fact the FADT flags report correctly
> > that the timer is 24 bits wide. On affected machines such as the
> > ASUS FX504GM and never gaming laptops this results in the inability
> > to resume the machine from suspend. Without this patch suspend is
> > broken on affected machines and even if a machine manages to resume
> > correctly then the kernel time and xen timers are trashed.
> > 
> > Signed-off-by: Grzegorz Uriasz 
> > ---
> > Cc: Jan Beulich 
> > Cc: Andrew Cooper 
> > Cc: Wei Liu 
> > Cc: "Roger Pau Monné" 
> > Cc: marma...@invisiblethingslab.com
> > Cc: cont...@puzio.waw.pl
> > Cc: ja...@bartmin.ski
> > Cc: j.nowa...@student.uw.edu.pl
> > 
> > Changes in v2:
> > - Check pm timer access width
> > - Proper timer width is set when xpm block is not present
> > - Cleanup timer initialization
> > 
> > Changes in v3:
> > - Check pm timer bit offset
> > - Warn about acpi firmware bugs
> > - Drop int cast in init_pmtimer
> > - Merge if's in init_pmtimer
> > ---
> >  xen/arch/x86/acpi/boot.c| 19 +++
> >  xen/arch/x86/time.c | 12 
> >  xen/include/acpi/acmacros.h |  8 
> >  3 files changed, 27 insertions(+), 12 deletions(-)
> > 
> > diff --git a/xen/arch/x86/acpi/boot.c b/xen/arch/x86/acpi/boot.c
> > index bcba52e232..24fd236eb5 100644
> > --- a/xen/arch/x86/acpi/boot.c
> > +++ b/xen/arch/x86/acpi/boot.c
> > @@ -475,10 +475,17 @@ static int __init acpi_parse_fadt(struct 
> > acpi_table_header *table)
> >  
> >  #ifdef CONFIG_X86_PM_TIMER
> > /* detect the location of the ACPI PM Timer */
> > -   if (fadt->header.revision >= FADT2_REVISION_ID) {
> > +   if (fadt->header.revision >= FADT2_REVISION_ID &&
> > +   fadt->xpm_timer_block.space_id == ACPI_ADR_SPACE_SYSTEM_IO) {
> > /* FADT rev. 2 */
> > -   if (fadt->xpm_timer_block.space_id ==
> > -   ACPI_ADR_SPACE_SYSTEM_IO) {
> > +   if (fadt->xpm_timer_block.access_width != 0 &&
> > +   ACPI_ACCESS_BIT_WIDTH(fadt->xpm_timer_block.access_width) 
> > != 32)
> > +   printk(KERN_WARNING PREFIX "PM-Timer has invalid access 
> > width(%u)\n",
> > +  fadt->xpm_timer_block.access_width);
> > +   else if (fadt->xpm_timer_block.bit_offset != 0)
> 
> This should be a plain if instead of an else if, it's possible the
> block has both the access_width and the bit_offset wrong.

My bad, realized this avoids setting pmtmr_ioport.

Roger.



Re: [XEN PATCH for-4.14 1/2] tools: Commit autoconf (2.69) output from Debian buster

2020-06-22 Thread Olaf Hering
On Fri, Jun 12, Ian Jackson wrote:

> These files are in tree so that people can build (including from git)
> without needing recent autotools.

Do you know those people who can not possibly install the required (now 8year 
old) autoconf version?

Olaf


signature.asc
Description: PGP signature


Re: [PATCH for-4.14] x86/tlb: fix assisted flush usage

2020-06-22 Thread Roger Pau Monné
On Fri, Jun 19, 2020 at 04:06:55PM +0200, Jan Beulich wrote:
> On 18.06.2020 18:04, Roger Pau Monne wrote:
> > Commit e9aca9470ed86 introduced a regression when avoiding sending
> > IPIs for certain flush operations. Xen page fault handler
> > (spurious_page_fault) relies on blocking interrupts in order to
> > prevent handling TLB flush IPIs and thus preventing other CPUs from
> > removing page tables pages. Switching to assisted flushing avoided such
> > IPIs, and thus can result in pages belonging to the page tables being
> > removed (and possibly re-used) while __page_fault_type is being
> > executed.
> > 
> > Force some of the TLB flushes to use IPIs, thus avoiding the assisted
> > TLB flush. Those selected flushes are the page type change (when
> > switching from a page table type to a different one, ie: a page that
> > has been removed as a page table) and page allocation. This sadly has
> > a negative performance impact on the pvshim, as less assisted flushes
> > can be used.
> > 
> > Introduce a new flag (FLUSH_FORCE_IPI) and helper to force a TLB flush
> > using an IPI (flush_tlb_mask_sync). Note that the flag is only
> > meaningfully defined when the hypervisor supports PV mode, as
> > otherwise translated domains are in charge of their page tables and
> > won't share page tables with Xen, thus not influencing the result of
> > page walks performed by the spurious fault handler.
> 
> Is this true for shadow mode? If a page shadowing a guest one was
> given back quickly enough to the allocator and then re-used, I think
> the same situation could in principle arise.

Hm, I think it's not applicable to HVM shadow mode at least, because
CR3 is switched as part of vmentry/vmexit, and the page tables are not
shared between Xen and the guest, so there's no way for a HVM shadow
guest to modify the page-tables while Xen is walking them in
spurious_page_fault (note spurious_page_fault is only called when the
fault happens in non-guest context).

> > Note the flag is not defined on Arm, and the introduced helper is just
> > a dummy alias to the existing flush_tlb_mask.
> > 
> > Fixes: e9aca9470ed86 ('x86/tlb: use Xen L0 assisted TLB flush when 
> > available')
> > Reported-by: Andrew Cooper 
> > Signed-off-by: Roger Pau Monné 
> > ---
> > It's my understanding that not doing such IPI flushes could lead to
> > the pages tables being read by __page_fault_type being modified by a
> > third party, which could make them point to other mfns out of the
> > scope of the guest allowed physical memory addresses. However those
> > accesses would be limited to __page_fault_type, and hence the main
> > worry would be that a guest could make it point to read from a
> > physical memory region that has side effects?
> 
> I don't think so, no - the memory could be changed such that the
> PTEs are invalid altogether (like having reserved bits set). Consider
> for example the case of reading an MFN out of such a PTE that's larger
> than the physical address width supported by the CPU. Afaict
> map_domain_page() will happily install a respective page table entry,
> but we'd get a reserved-bit #PF upon reading from that mapping.

So there are no hazards from executing __page_fault_type against a
page-table that could be modified by a user?

> > ---
> >  xen/arch/x86/mm.c  | 12 +++-
> >  xen/common/memory.c|  2 +-
> >  xen/common/page_alloc.c|  2 +-
> >  xen/include/asm-arm/flushtlb.h |  1 +
> >  xen/include/asm-x86/flushtlb.h | 14 ++
> >  xen/include/xen/mm.h   |  8 ++--
> >  6 files changed, 34 insertions(+), 5 deletions(-)
> 
> Not finding a consumer of the new flag, my first reaction was to
> ask whether there's code missing somewhere. Having looked at
> flush_area_mask() another time I now understand the itended
> behavior results because of the extra flag now allowing
> hypervisor_flush_tlb() to be entered. I think that's something
> that's worth calling out in the description, or perhaps even in
> the comment next to the #define.

Oh right, the condition to use assisted flush is not actually changed
in flush_area_mask since setting any bit in the flags would prevent
using it.

> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -2894,7 +2894,17 @@ static int _get_page_type(struct page_info *page, 
> > unsigned long type,
> >((nx & PGT_type_mask) == PGT_writable_page)) )
> >  {
> >  perfc_incr(need_flush_tlb_flush);
> > -flush_tlb_mask(mask);
> > +if ( (x & PGT_type_mask) &&
> > + (x & PGT_type_mask) <= PGT_l4_page_table )
> 
> With there being 5-level page tables around the corner, I think
> we ought to get used to use PGT_root_page_table (or alike)
> whenever possible, to avoid having to touch such code when
> adding support for the new paging mode.
> 
> > --- a/xen/include/asm-x86/flushtlb.h
> > +++ b/xen/include/asm-x86/flu

[xen-unstable test] 151273: tolerable FAIL - PUSHED

2020-06-22 Thread osstest service owner
flight 151273 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/151273/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-rtds 18 guest-localmigrate/x10 fail in 151245 pass in 
151273
 test-amd64-i386-xl-qemut-debianhvm-i386-xsm 10 debian-hvm-install fail in 
151245 pass in 151273
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16 
guest-start/debianhvm.repeat fail pass in 151245

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt 14 saverestore-support-checkfail  like 151181
 test-armhf-armhf-xl-rtds 16 guest-start/debian.repeatfail  like 151181
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stopfail like 151181
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop fail like 151181
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stopfail like 151181
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop fail like 151181
 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stopfail like 151181
 test-armhf-armhf-libvirt-raw 13 saverestore-support-checkfail  like 151181
 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stopfail like 151181
 test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop fail like 151181
 test-amd64-i386-xl-pvshim12 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  14 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  13 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 14 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-arndale  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  14 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-rtds 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 14 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  14 saverestore-support-checkfail   never pass
 test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop  fail never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 14 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-vhd  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  13 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 12 migrate-support-checkfail   never pass
 test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2  fail never pass

version targeted for testing:
 xen  fde76f895d0aa817a1207d844d793239c6639bc6
baseline version:
 xen  3625b04991b4d6affadd99d377ab84bac48dfff4

Last test of basis   151181  2020-06-17 06:01:52 Z5 days
Failing since151224  2020-06-18 16:01:07 Z  

Re: [PATCH v3 1/1] x86/acpi: Use FADT flags to determine the PMTMR width

2020-06-22 Thread Jan Beulich
On 22.06.2020 10:54, Roger Pau Monné wrote:
> On Fri, Jun 19, 2020 at 10:00:16PM +, Grzegorz Uriasz wrote:
>> @@ -490,8 +497,12 @@ static int __init acpi_parse_fadt(struct 
>> acpi_table_header *table)
>>   */
>>  if (!pmtmr_ioport) {
>>  pmtmr_ioport = fadt->pm_timer_block;
>> -pmtmr_width = fadt->pm_timer_length == 4 ? 24 : 0;
>> +pmtmr_width = fadt->pm_timer_length == 4 ? 32 : 0;
>>  }
>> +if (pmtmr_width < 32 && fadt->flags & ACPI_FADT_32BIT_TIMER)
>> +printk(KERN_WARNING PREFIX "PM-Timer is too short\n");
> 
> Indentation. Also this should be a fatal error likely? You should set
> pmtmr_ioport = 0?

Not sure, to be honest. It's entirely fuzzy what mode to use in this
case, yet assuming a working 24-bit timer looks valid in this case.
And indeed we'd use the timer (with a 24-bit mask) exactly when
pmtmr_width == 24 (alongside the bogus set ACPI_FADT_32BIT_TIMER bit).

What I do notice only now though is that inside the if() there's a
pair of parentheses missing.

Jan



Re: Event delivery and "domain blocking" on PVHv2

2020-06-22 Thread Martin Lucina

On 2020-06-19 19:42, Roger Pau Monné wrote:

On Fri, Jun 19, 2020 at 06:54:26PM +0200, Roger Pau Monné wrote:

On Fri, Jun 19, 2020 at 06:41:21PM +0200, Martin Lucina wrote:
> On 2020-06-19 13:21, Roger Pau Monné wrote:
> > On Fri, Jun 19, 2020 at 12:28:50PM +0200, Martin Lucina wrote:
> > > On 2020-06-18 13:46, Roger Pau Monné wrote:
> > > > On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
> > > > > At this point I don't really have a clear idea of how to progress,
> > > > > comparing my implementation side-by-side with the original PV
> > > > > Mini-OS-based
> > > > > implementation doesn't show up any differences I can see.
> > > > >
> > > > > AFAICT the OCaml code I've also not changed in any material way, and
> > > > > that
> > > > > has been running in production on PV for years, so I'd be inclined
> > > > > to think
> > > > > the problem is in my reimplementation of the C parts, but where...?
> > > >
> > > > A good start would be to print the ISR and IRR lapic registers when
> > > > blocked, to assert there are no pending vectors there.
> > > >
> > > > Can you apply the following patch to your Xen, rebuild and check the
> > > > output of the 'l' debug key?
> > > >
> > > > Also add the output of the 'v' key.
> > >
> > > Had to fight the Xen Debian packages a bit as I wanted to patch the
> > > exact
> > > same Xen (there are some failures when building on a system that has
> > > Xen
> > > installed due to following symlinks when fixing shebangs).
> > >
> > > Here you go, when stuck during netfront setup, after allocating its
> > > event
> > > channel, presumably waiting on Xenstore:
> > >
> > > 'e':
> > >
> > > (XEN) Event channel information for domain 3:
> > > (XEN) Polling vCPUs: {}
> > > (XEN) port [p/m/s]
> > > (XEN)1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> > > (XEN)2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> > > (XEN)3 [1/0/1]: s=5 n=0 x=0 v=0
> > > (XEN)4 [0/1/1]: s=2 n=0 x=0 d=0
> > >
> > > 'l':
> > >
> > > (XEN) d3v0 IRR:
> > > 8301732dc200b
> > > (XEN) d3v0 ISR:
> > > 8301732dc100b
> >
> > Which version of Xen is this? AFAICT it doesn't have the support to
> > print a bitmap.
>
> That in Debian 10 (stable):
>
> ii  xen-hypervisor-4.11-amd644.11.3+24-g14b62ab3e5-1~deb10u1.2
> amd64Xen Hypervisor on AMD64
>
> xen_major  : 4
> xen_minor  : 11
> xen_extra  : .4-pre
> xen_version: 4.11.4-pre
>
> >
> > Do you think you could also pick commit
> > 8cd9500958d818e3deabdd0d4164ea6fe1623d7c [0] and rebuild? (and print
> > the info again).
>
> Done, here you go:
>
> (XEN) Event channel information for domain 3:
> (XEN) Polling vCPUs: {}
> (XEN) port [p/m/s]
> (XEN)1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> (XEN)2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> (XEN)3 [1/0/1]: s=5 n=0 x=0 v=0
> (XEN)4 [0/1/1]: s=3 n=0 x=0 d=0 p=35
>
>
> (XEN) d3v0 IRR:
> ,,,,,,,
> (XEN) d3v0 ISR:
> ,,,,,,,

So there's nothing pending on the lapic. Can you assert that you will
always execute evtchn_demux_pending after you have received an event
channel interrupt (ie: executed solo5__xen_evtchn_vector_handler)?

I think this would be simpler if you moved evtchn_demux_pending into
solo5__xen_evtchn_vector_handler? As there would be less asynchronous
processing, and thus likely less races?


Having though about this, I think this model of not demuxing in
solo5__xen_evtchn_vector_handler is always racy, as it's not possible
to assert that you would always call evtchn_demux_pending after
solo5__xen_evtchn_vector_handler?

Ie: if you receive an interrupt just before going to sleep (after the
sti and before the hlt) you will execute
solo5__xen_evtchn_vector_handler and EOI the vector, but then
evtchn_demux_pending will never get called, and thus the interrupts
will stay indefinitely pending?


Aha! Thank you for pointing this out. I think you may be right, but this 
should be possible without doing the demuxing in interrupt context.


How about this arrangement, which appears to work for me; no hangs I can 
see so far and domU survives ping -f fine with no packet loss:


CAMLprim value
mirage_xen_evtchn_block_domain(value v_deadline)
{
struct vcpu_info *vi = VCPU0_INFO();
solo5_time_t deadline = Int64_val(v_deadline);

if (solo5_clock_monotonic() < deadline) {
__asm__ __volatile__ ("cli" : : : "memory");
if (vi->evtchn_upcall_pending) {
__asm__ __volatile__ ("sti");
}
else {
hypercall_set_timer_op(deadline);
__asm__ __volatile__ ("sti; hlt");
}
}
return Val_unit;
}

i.e. Always go to sleep with interrupts disabled, but before doing so 
re-check that no events have become pending since the last time 
evtchn_demux_pending() was called. This holds, since the only thing that 
sets vi->evtchn_upc

Re: [PATCH for-4.14] x86/tlb: fix assisted flush usage

2020-06-22 Thread Jan Beulich
On 22.06.2020 11:31, Roger Pau Monné wrote:
> On Fri, Jun 19, 2020 at 04:06:55PM +0200, Jan Beulich wrote:
>> On 18.06.2020 18:04, Roger Pau Monne wrote:
>>> Commit e9aca9470ed86 introduced a regression when avoiding sending
>>> IPIs for certain flush operations. Xen page fault handler
>>> (spurious_page_fault) relies on blocking interrupts in order to
>>> prevent handling TLB flush IPIs and thus preventing other CPUs from
>>> removing page tables pages. Switching to assisted flushing avoided such
>>> IPIs, and thus can result in pages belonging to the page tables being
>>> removed (and possibly re-used) while __page_fault_type is being
>>> executed.
>>>
>>> Force some of the TLB flushes to use IPIs, thus avoiding the assisted
>>> TLB flush. Those selected flushes are the page type change (when
>>> switching from a page table type to a different one, ie: a page that
>>> has been removed as a page table) and page allocation. This sadly has
>>> a negative performance impact on the pvshim, as less assisted flushes
>>> can be used.
>>>
>>> Introduce a new flag (FLUSH_FORCE_IPI) and helper to force a TLB flush
>>> using an IPI (flush_tlb_mask_sync). Note that the flag is only
>>> meaningfully defined when the hypervisor supports PV mode, as
>>> otherwise translated domains are in charge of their page tables and
>>> won't share page tables with Xen, thus not influencing the result of
>>> page walks performed by the spurious fault handler.
>>
>> Is this true for shadow mode? If a page shadowing a guest one was
>> given back quickly enough to the allocator and then re-used, I think
>> the same situation could in principle arise.
> 
> Hm, I think it's not applicable to HVM shadow mode at least, because
> CR3 is switched as part of vmentry/vmexit, and the page tables are not
> shared between Xen and the guest, so there's no way for a HVM shadow
> guest to modify the page-tables while Xen is walking them in
> spurious_page_fault (note spurious_page_fault is only called when the
> fault happens in non-guest context).

I'm afraid I disagree, because of shadow's use of "linear page tables".

>>> Note the flag is not defined on Arm, and the introduced helper is just
>>> a dummy alias to the existing flush_tlb_mask.
>>>
>>> Fixes: e9aca9470ed86 ('x86/tlb: use Xen L0 assisted TLB flush when 
>>> available')
>>> Reported-by: Andrew Cooper 
>>> Signed-off-by: Roger Pau Monné 
>>> ---
>>> It's my understanding that not doing such IPI flushes could lead to
>>> the pages tables being read by __page_fault_type being modified by a
>>> third party, which could make them point to other mfns out of the
>>> scope of the guest allowed physical memory addresses. However those
>>> accesses would be limited to __page_fault_type, and hence the main
>>> worry would be that a guest could make it point to read from a
>>> physical memory region that has side effects?
>>
>> I don't think so, no - the memory could be changed such that the
>> PTEs are invalid altogether (like having reserved bits set). Consider
>> for example the case of reading an MFN out of such a PTE that's larger
>> than the physical address width supported by the CPU. Afaict
>> map_domain_page() will happily install a respective page table entry,
>> but we'd get a reserved-bit #PF upon reading from that mapping.
> 
> So there are no hazards from executing __page_fault_type against a
> page-table that could be modified by a user?

There are - I realize only now that the way I started my earlier
reply was ambiguous. I meant to negate your "only with side effects"
way of thinking.

Jan



Re: [RFC XEN PATCH 19/23] riscv: Add the lib directory

2020-06-22 Thread Jan Beulich
On 22.01.2020 02:58, Bobby Eshleman wrote:
> From: Alistair Francis 
> 
> Add the lib directory with find_next_bit support.
> 
> This was taken from Linux

As was Arm64's - the two definitely would want folding.

Jan



Re: [RFC XEN PATCH 22/23] riscv: Add sysctl.c

2020-06-22 Thread Jan Beulich
On 22.01.2020 02:59, Bobby Eshleman wrote:
> --- /dev/null
> +++ b/xen/arch/riscv/sysctl.c
> @@ -0,0 +1,31 @@
> +/**
> + * Arch-specific sysctl.c
> + *
> + * System management operations. For use by node control stack.
> + *
> + * Copyright (c) 2012, Citrix Systems
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +void arch_do_physinfo(struct xen_sysctl_physinfo *pi) { }
> +
> +long arch_do_sysctl(struct xen_sysctl *sysctl,
> +XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
> +{
> +return -ENOSYS;

At the example of this (there may be more in this series) - -EOPNOTSUPP
please. Only top level hypercall handlers ought to produce -ENOSYS, for
major hypercall numbers with no handler.

Jan



[PATCH] x86/CPUID: fill all fields in x86_cpuid_policy_fill_native()

2020-06-22 Thread Jan Beulich
Coverity validly complains that the new call from
tools/tests/cpu-policy/test-cpu-policy.c:test_cpuid_current() leaves
two fields uninitialized, yet they get then consumed by
x86_cpuid_copy_to_buffer(). (All other present callers of the function
pass a pointer to a static - and hence initialized - buffer.)

Coverity-ID: 1464809
Fixes: c22ced93e167 ("tests/cpu-policy: Confirm that CPUID serialisation is 
sorted")
Signed-off-by: Jan Beulich 

--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -176,6 +176,10 @@ void x86_cpuid_policy_fill_native(struct
   ARRAY_SIZE(p->extd.raw) - 1); ++i )
 cpuid_leaf(0x8000 + i, &p->extd.raw[i]);
 
+/* Don't report leaves from possible lower level hypervisor. */
+p->hv_limit = 0;
+p->hv2_limit = 0;
+
 x86_cpuid_policy_recalc_synth(p);
 }
 



Re: [PATCH v2 2/7] x86/vmx: add Intel PT MSR definitions

2020-06-22 Thread Jan Beulich
On 19.06.2020 01:39, Michał Leszczyński wrote:
> --- a/xen/include/asm-x86/msr-index.h
> +++ b/xen/include/asm-x86/msr-index.h
> @@ -621,4 +621,41 @@
>  #define MSR_PKGC9_IRTL   0x0634
>  #define MSR_PKGC10_IRTL  0x0635
>  
> +/* Intel PT MSRs */
> +#define MSR_RTIT_OUTPUT_BASE   0x0560
> +#define MSR_RTIT_OUTPUT_MASK   0x0561
> +#define MSR_RTIT_CTL   0x0570
> +#define RTIT_CTL_TRACEEN   (_AC(1, ULL) << 0)
> +#define RTIT_CTL_CYCEN (_AC(1, ULL) << 1)
> +#define RTIT_CTL_OS(_AC(1, ULL) << 2)
> +#define RTIT_CTL_USR   (_AC(1, ULL) << 3)
> +#define RTIT_CTL_PWR_EVT_EN(_AC(1, ULL) << 4)
> +#define RTIT_CTL_FUP_ON_PTW(_AC(1, ULL) << 5)
> +#define RTIT_CTL_FABRIC_EN (_AC(1, ULL) << 6)
> +#define RTIT_CTL_CR3_FILTER(_AC(1, ULL) << 7)
> +#define RTIT_CTL_TOPA  (_AC(1, ULL) << 8)
> +#define RTIT_CTL_MTC_EN(_AC(1, ULL) << 9)
> +#define RTIT_CTL_TSC_EN(_AC(1, ULL) << 10)
> +#define RTIT_CTL_DIS_RETC  (_AC(1, ULL) << 11)
> +#define RTIT_CTL_PTW_EN(_AC(1, ULL) << 12)
> +#define RTIT_CTL_BRANCH_EN (_AC(1, ULL) << 13)
> +#define RTIT_CTL_MTC_FREQ_OFFSET   14
> +#define RTIT_CTL_MTC_FREQ  (0x0fULL << RTIT_CTL_MTC_FREQ_OFFSET)

This was a fair step in the right direction, but you've missed
some instances (like here) wanting to use _AC(), and you've
also not moved the whole block up above the legacy line. As
Andrew's "x86/msr: Disallow access to Processor Trace MSRs" is
likely to go in before 4.14 in some form, you'll want to
re-base over it eventually anyway. You may want to take a look
there right away, to see where in the header to place your
addition.

If you look further up in the file you'll also notice how we
try to visually separate MSR numbers from bit-within-MSR
definitions.

Finally I'd also like to ask that you omit all RTIT_CTL_*_OFFSET
values. Only the _OFFSET-less #define-s should really be needed
- see MASK_EXTR() and MASK_INSR() in case right now you're using
these for some open-coded shifting ...

Jan



Re: [PATCH] x86/CPUID: fill all fields in x86_cpuid_policy_fill_native()

2020-06-22 Thread Andrew Cooper
On 22/06/2020 13:09, Jan Beulich wrote:
> Coverity validly complains that the new call from
> tools/tests/cpu-policy/test-cpu-policy.c:test_cpuid_current() leaves
> two fields uninitialized, yet they get then consumed by
> x86_cpuid_copy_to_buffer(). (All other present callers of the function
> pass a pointer to a static - and hence initialized - buffer.)
>
> Coverity-ID: 1464809
> Fixes: c22ced93e167 ("tests/cpu-policy: Confirm that CPUID serialisation is 
> sorted")
> Signed-off-by: Jan Beulich 
>
> --- a/xen/lib/x86/cpuid.c
> +++ b/xen/lib/x86/cpuid.c
> @@ -176,6 +176,10 @@ void x86_cpuid_policy_fill_native(struct
>ARRAY_SIZE(p->extd.raw) - 1); ++i )
>  cpuid_leaf(0x8000 + i, &p->extd.raw[i]);
>  
> +/* Don't report leaves from possible lower level hypervisor. */

", for now."

This will change in the future.

With this, Reviewed-by: Andrew Cooper 

> +p->hv_limit = 0;
> +p->hv2_limit = 0;
> +
>  x86_cpuid_policy_recalc_synth(p);
>  }
>  




Re: [PATCH v2 3/7] x86/vmx: add IPT cpu feature

2020-06-22 Thread Jan Beulich
On 19.06.2020 01:40, Michał Leszczyński wrote:
> @@ -630,6 +633,12 @@ static inline bool hvm_altp2m_supported(void)
>  return hvm_funcs.altp2m_supported;
>  }
>  
> +/* returns true if hardware supports Intel Processor Trace */
> +static inline bool hvm_pt_supported(void)
> +{
> +return hvm_funcs.pt_supported;
> +}

This is vendor-independent code, hence I'd like to see "Intel"
dropped from the comment.

> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -285,6 +285,7 @@ extern u64 vmx_ept_vpid_cap;
>  
>  #define VMX_MISC_CR3_TARGET 0x01ff
>  #define VMX_MISC_VMWRITE_ALL0x2000
> +#define VMX_MISC_PT_SUPPORTED   0x4000

Move this up by two lines, so the values continue to be numerically
sorted?

Jan



[xen-4.12-testing test] 151276: regressions - FAIL

2020-06-22 Thread osstest service owner
flight 151276 xen-4.12-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/151276/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-i386-prev   6 xen-buildfail REGR. vs. 150176
 build-amd64-prev  6 xen-buildfail REGR. vs. 150176

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-migrupgrade  1 build-check(1)   blocked  n/a
 test-amd64-i386-migrupgrade   1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qcow217 guest-localmigrate/x10   fail  like 150176
 test-amd64-i386-xl-pvshim12 guest-start  fail   never pass
 test-amd64-amd64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 13 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  13 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  14 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-arm64-arm64-xl-credit2  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 14 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-arm64-arm64-xl-thunderx 13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 14 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 12 migrate-support-checkfail   never pass
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop  fail never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 13 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt 14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 14 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-checkfail  never pass
 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop fail never pass
 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop fail never pass
 test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop  fail never pass
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop fail never pass
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop  fail never pass
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop fail never pass
 test-armhf-armhf-xl  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 12 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 13 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  13 saverestore-support-checkfail   never pass
 test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop  fail never pass
 test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2  fail never pass

version targeted for testing:
 xen  06760c2bf322cef4622b56bafee74fe93ae01630
baseline version:
 xen  09b61126b4d1e9d372fd2e24b702be10a358da9d

Last test of basis   150176  2020-05-14 12:36:34 Z   39 days
Failing s

Re: [PATCH for-4.14] x86/tlb: fix assisted flush usage

2020-06-22 Thread Roger Pau Monné
On Mon, Jun 22, 2020 at 01:07:10PM +0200, Jan Beulich wrote:
> On 22.06.2020 11:31, Roger Pau Monné wrote:
> > On Fri, Jun 19, 2020 at 04:06:55PM +0200, Jan Beulich wrote:
> >> On 18.06.2020 18:04, Roger Pau Monne wrote:
> >>> Commit e9aca9470ed86 introduced a regression when avoiding sending
> >>> IPIs for certain flush operations. Xen page fault handler
> >>> (spurious_page_fault) relies on blocking interrupts in order to
> >>> prevent handling TLB flush IPIs and thus preventing other CPUs from
> >>> removing page tables pages. Switching to assisted flushing avoided such
> >>> IPIs, and thus can result in pages belonging to the page tables being
> >>> removed (and possibly re-used) while __page_fault_type is being
> >>> executed.
> >>>
> >>> Force some of the TLB flushes to use IPIs, thus avoiding the assisted
> >>> TLB flush. Those selected flushes are the page type change (when
> >>> switching from a page table type to a different one, ie: a page that
> >>> has been removed as a page table) and page allocation. This sadly has
> >>> a negative performance impact on the pvshim, as less assisted flushes
> >>> can be used.
> >>>
> >>> Introduce a new flag (FLUSH_FORCE_IPI) and helper to force a TLB flush
> >>> using an IPI (flush_tlb_mask_sync). Note that the flag is only
> >>> meaningfully defined when the hypervisor supports PV mode, as
> >>> otherwise translated domains are in charge of their page tables and
> >>> won't share page tables with Xen, thus not influencing the result of
> >>> page walks performed by the spurious fault handler.
> >>
> >> Is this true for shadow mode? If a page shadowing a guest one was
> >> given back quickly enough to the allocator and then re-used, I think
> >> the same situation could in principle arise.
> > 
> > Hm, I think it's not applicable to HVM shadow mode at least, because
> > CR3 is switched as part of vmentry/vmexit, and the page tables are not
> > shared between Xen and the guest, so there's no way for a HVM shadow
> > guest to modify the page-tables while Xen is walking them in
> > spurious_page_fault (note spurious_page_fault is only called when the
> > fault happens in non-guest context).
> 
> I'm afraid I disagree, because of shadow's use of "linear page tables".

You will have to bear with me, but I don't follow.

Could you provide some pointers at how/where the shadow (I assume
guest controlled) "linear page tables" are used while in Xen
context?

do_page_fault will only call spurious_page_fault (and thus attempt a
page walk) when the fault happens in non-guest context, yet on HVM all
accesses to guest memory are performed using __hvm_copy which doesn't
load any guest page tables into CR3, but instead performs a software
walk of the paging structures in order to find and map the destination
address into the hypervisor page tables and perform the read or copy.

> >>> Note the flag is not defined on Arm, and the introduced helper is just
> >>> a dummy alias to the existing flush_tlb_mask.
> >>>
> >>> Fixes: e9aca9470ed86 ('x86/tlb: use Xen L0 assisted TLB flush when 
> >>> available')
> >>> Reported-by: Andrew Cooper 
> >>> Signed-off-by: Roger Pau Monné 
> >>> ---
> >>> It's my understanding that not doing such IPI flushes could lead to
> >>> the pages tables being read by __page_fault_type being modified by a
> >>> third party, which could make them point to other mfns out of the
> >>> scope of the guest allowed physical memory addresses. However those
> >>> accesses would be limited to __page_fault_type, and hence the main
> >>> worry would be that a guest could make it point to read from a
> >>> physical memory region that has side effects?
> >>
> >> I don't think so, no - the memory could be changed such that the
> >> PTEs are invalid altogether (like having reserved bits set). Consider
> >> for example the case of reading an MFN out of such a PTE that's larger
> >> than the physical address width supported by the CPU. Afaict
> >> map_domain_page() will happily install a respective page table entry,
> >> but we'd get a reserved-bit #PF upon reading from that mapping.
> > 
> > So there are no hazards from executing __page_fault_type against a
> > page-table that could be modified by a user?
> 
> There are - I realize only now that the way I started my earlier
> reply was ambiguous. I meant to negate your "only with side effects"
> way of thinking.

Ack.

Thanks, Roger.



Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Jan Beulich
On 19.06.2020 01:41, Michał Leszczyński wrote:
> @@ -1631,6 +1649,8 @@ void hvm_vcpu_destroy(struct vcpu *v)
>  vlapic_destroy(v);
>  
>  hvm_vcpu_cacheattr_destroy(v);
> +
> +hvm_vmtrace_destroy(v);
>  }

Whenever possible resource cleanup should occur from
hvm_domain_relinquish_resources(). Is there anything preventing
this here?

> @@ -4066,6 +4086,51 @@ static int hvmop_set_evtchn_upcall_vector(
>  return 0;
>  }
>  
> +static int hvm_set_vmtrace_pt_size(struct domain *d, uint64_t value)
> +{
> +void *buf;
> +unsigned int buf_order;
> +struct page_info *pg;
> +struct ipt_state *ipt;
> +struct vcpu *v;
> +
> +if ( value < PAGE_SIZE ||
> + value > GB(4) ||
> + ( value & (value - 1) ) ) {
> +/* we don't accept trace buffer size smaller than single page
> + * and the upper bound is defined as 4GB in the specification */
> +return -EINVAL;
> +}
> +
> +for_each_vcpu ( d, v )
> +{
> +buf_order = get_order_from_bytes(value);
> +pg = alloc_domheap_pages(d, buf_order, MEMF_no_refcount);
> +
> +if ( !pg )
> +return -EFAULT;
> +
> +buf = page_to_virt(pg);

In addition to what Roger has said here, just fyi that you can't
use page_to_virt() on anything returned from alloc_domheap_pages(),
unless you suitably restrict the address range such that the
result is guaranteed to have a direct mapping.

> @@ -4949,6 +5018,100 @@ static int compat_altp2m_op(
>  return rc;
>  }
>  
> +static int do_vmtrace_op(XEN_GUEST_HANDLE_PARAM(void) arg)
> +{
> +struct xen_hvm_vmtrace_op a;
> +struct domain *d;
> +int rc;
> +struct vcpu *v;
> +struct ipt_state *ipt;
> +
> +if ( !hvm_pt_supported() )
> +return -EOPNOTSUPP;
> +
> +if ( copy_from_guest(&a, arg, 1) )
> +return -EFAULT;
> +
> +if ( a.version != HVMOP_VMTRACE_INTERFACE_VERSION )
> +return -EINVAL;
> +
> +d = rcu_lock_domain_by_any_id(a.domain);
> +spin_lock(&d->vmtrace_lock);
> +
> +if ( d == NULL )
> +return -ESRCH;
> +
> +if ( !is_hvm_domain(d) )
> +{
> +rc = -EOPNOTSUPP;
> +goto out;
> +}
> +
> +domain_pause(d);

Who's the intended caller of this interface? You making it a hvm-op
suggests the guest may itself call this. But of course a guest
can't pause itself. If this is supposed to be a tools-only interface,
then you should frame it suitably in the public header and of course
you need to enforce this here (which would e.g. mean you shouldn't
use rcu_lock_domain_by_any_id()).

Also please take a look at hvm/ioreq.c, which makes quite a bit of
use of domain_pause(). In particular I think you want to acquire
the lock only after having paused the domain.

> +if ( a.vcpu >= d->max_vcpus )
> +{
> +rc = -EINVAL;
> +goto out;
> +}
> +
> +v = d->vcpu[a.vcpu];
> +ipt = v->arch.hvm.vmx.ipt_state;
> +
> +if ( !ipt )
> +{
> +/*
> +  * PT must be first initialized upon domain creation.
> +  */
> +rc = -EINVAL;
> +goto out;
> +}
> +
> +switch ( a.cmd )
> +{
> +case HVMOP_vmtrace_ipt_enable:
> +if ( vmx_add_guest_msr(v, MSR_RTIT_CTL,
> +   RTIT_CTL_TRACEEN | RTIT_CTL_OS |
> +   RTIT_CTL_USR | RTIT_CTL_BRANCH_EN) )
> +{
> +rc = -EFAULT;
> +goto out;
> +}
> +
> +ipt->active = 1;
> +break;
> +case HVMOP_vmtrace_ipt_disable:
> +if ( vmx_add_guest_msr(v, MSR_RTIT_CTL, 0) )

Shouldn't you rather remove the MSR from the load list here?

Is any of what you do in this switch() actually legitimate without
hvm_set_vmtrace_pt_size() having got called for the guest? From
remarks elsewhere I imply you expect the param that you currently
use to be set upon domain creation time, but at the very least the
potentially big buffer should imo not get allocated up front, but
only when tracing is to actually be enabled.

Also please have blank lines between individual case blocks.

> +{
> +rc = -EFAULT;
> +goto out;
> +}
> +
> +ipt->active = 0;
> +break;
> +case HVMOP_vmtrace_ipt_get_offset:
> +a.offset = ipt->output_mask.offset;
> +break;
> +default:
> +rc = -EOPNOTSUPP;
> +goto out;
> +}
> +
> +rc = -EFAULT;
> +if ( __copy_to_guest(arg, &a, 1) )
> +  goto out;

Only HVMOP_vmtrace_ipt_get_offset needs this - please avoid copying
back on other paths. Even there you could in principle copy back
just the one field; the function taking XEN_GUEST_HANDLE_PARAM(void)
gets in the way of this, though.

The last line above also has an indentation issue, but the use of
"goto" in this case can be avoided anyway.

> +rc = 0;
> +
> + out:
> +domain_unpause(d);
> +spin_unlock(&d->vmtrace_lock);
> +rcu_unlock_domain(d);
> +
> +retur

Re: [PATCH] x86/CPUID: fill all fields in x86_cpuid_policy_fill_native()

2020-06-22 Thread Jan Beulich
On 22.06.2020 14:39, Andrew Cooper wrote:
> On 22/06/2020 13:09, Jan Beulich wrote:
>> Coverity validly complains that the new call from
>> tools/tests/cpu-policy/test-cpu-policy.c:test_cpuid_current() leaves
>> two fields uninitialized, yet they get then consumed by
>> x86_cpuid_copy_to_buffer(). (All other present callers of the function
>> pass a pointer to a static - and hence initialized - buffer.)
>>
>> Coverity-ID: 1464809
>> Fixes: c22ced93e167 ("tests/cpu-policy: Confirm that CPUID serialisation is 
>> sorted")
>> Signed-off-by: Jan Beulich 
>>
>> --- a/xen/lib/x86/cpuid.c
>> +++ b/xen/lib/x86/cpuid.c
>> @@ -176,6 +176,10 @@ void x86_cpuid_policy_fill_native(struct
>>ARRAY_SIZE(p->extd.raw) - 1); ++i )
>>  cpuid_leaf(0x8000 + i, &p->extd.raw[i]);
>>  
>> +/* Don't report leaves from possible lower level hypervisor. */
> 
> ", for now."
> 
> This will change in the future.

I was pondering at that moment whether to add it, but then I didn't
think we'd want to let shine through lower level hypervisor info.
But yes, I've added it, because it won't be wrong to say "for now",
even if it end up being for much longer.

> With this, Reviewed-by: Andrew Cooper 

Thanks, Jan



Re: [PATCH for-4.14] x86/tlb: fix assisted flush usage

2020-06-22 Thread Jan Beulich
On 22.06.2020 15:24, Roger Pau Monné wrote:
> On Mon, Jun 22, 2020 at 01:07:10PM +0200, Jan Beulich wrote:
>> On 22.06.2020 11:31, Roger Pau Monné wrote:
>>> On Fri, Jun 19, 2020 at 04:06:55PM +0200, Jan Beulich wrote:
 On 18.06.2020 18:04, Roger Pau Monne wrote:
> Commit e9aca9470ed86 introduced a regression when avoiding sending
> IPIs for certain flush operations. Xen page fault handler
> (spurious_page_fault) relies on blocking interrupts in order to
> prevent handling TLB flush IPIs and thus preventing other CPUs from
> removing page tables pages. Switching to assisted flushing avoided such
> IPIs, and thus can result in pages belonging to the page tables being
> removed (and possibly re-used) while __page_fault_type is being
> executed.
>
> Force some of the TLB flushes to use IPIs, thus avoiding the assisted
> TLB flush. Those selected flushes are the page type change (when
> switching from a page table type to a different one, ie: a page that
> has been removed as a page table) and page allocation. This sadly has
> a negative performance impact on the pvshim, as less assisted flushes
> can be used.
>
> Introduce a new flag (FLUSH_FORCE_IPI) and helper to force a TLB flush
> using an IPI (flush_tlb_mask_sync). Note that the flag is only
> meaningfully defined when the hypervisor supports PV mode, as
> otherwise translated domains are in charge of their page tables and
> won't share page tables with Xen, thus not influencing the result of
> page walks performed by the spurious fault handler.

 Is this true for shadow mode? If a page shadowing a guest one was
 given back quickly enough to the allocator and then re-used, I think
 the same situation could in principle arise.
>>>
>>> Hm, I think it's not applicable to HVM shadow mode at least, because
>>> CR3 is switched as part of vmentry/vmexit, and the page tables are not
>>> shared between Xen and the guest, so there's no way for a HVM shadow
>>> guest to modify the page-tables while Xen is walking them in
>>> spurious_page_fault (note spurious_page_fault is only called when the
>>> fault happens in non-guest context).
>>
>> I'm afraid I disagree, because of shadow's use of "linear page tables".
> 
> You will have to bear with me, but I don't follow.
> 
> Could you provide some pointers at how/where the shadow (I assume
> guest controlled) "linear page tables" are used while in Xen
> context?

See config.h:

/* Slot 258: linear page table (guest table). */
#define LINEAR_PT_VIRT_START(PML4_ADDR(258))
#define LINEAR_PT_VIRT_END  (LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)
/* Slot 259: linear page table (shadow table). */
#define SH_LINEAR_PT_VIRT_START (PML4_ADDR(259))
#define SH_LINEAR_PT_VIRT_END   (SH_LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)

These linear page tables exist in the Xen page tables at basically
all times as long as a shadow guest's vCPU is in context. They're
there to limit the overhead of accessing guest page tables and
their shadows from inside Xen.

Jan



Re: UEFI support in ARM DomUs

2020-06-22 Thread Oleksandr Andrushchenko

On 6/19/20 4:29 PM, Oleksandr Andrushchenko wrote:
> On 6/19/20 4:15 PM, Julien Grall wrote:
>>
>>
>> On 19/06/2020 14:06, Oleksandr Andrushchenko wrote:
>>>
>>> On 6/19/20 3:59 PM, Julien Grall wrote:
 Hi,

 On 19/06/2020 13:51, Oleksandr Andrushchenko wrote:
> On 6/19/20 3:47 PM, Julien Grall wrote:
>> They will not be available from the fdt, but you can retrieve them with 
>> an hypervisor call (see HVM_PARAM_STORE_PFN, HVM_PARAM_CONSOLE_PFN).
> Yes, and it used in the relevant pieces of code (hyp calls)
>> One question though, why do you need to map them in advance? Couldn't 
>> you map them on demand?
>
> Well, we need to at least estimate the pg_table size so we can reserve 
> and allocate memory later,

 Oh, so U-boot doesn't support runtime page-table table allocation. Is that 
 right?
>>> As per my understanding no, we provide a memory map and the tables are 
>>> allocated beforehand
>>
>> Ok :(.
>>

>
> so I have to provide memory range from either by coding a constant or 
> looking into the devtree at
>
> hypervisor { reg = <>; }. It is a bit tricky though

 Looking for a node in the device-tree shouldn't be too difficult given 
 that you have fdt_* available.

 However, please not that  doesn't refer to the guest magic pages. 
 Instead, it provides a region you can use for mapping the grant-table 
 frames
>>>
>>> Indeed, this is in my case 0x3800, but the magic is at 0x3900
>>>
>>> So, I need the memory range set up beforehand, but I can't as there is no 
>>> cute way to get that.
>>>
>>> Of course, I can issue a hyp call to get HVM_PARAM_CONSOLE_PFN and use it 
>>> as the base address,
>>>
>>> but this smells like a hack. I can call other HVM_PARAM_ to get their pfns 
>>> and set up the memory regions,
>>>
>>> but this looks a bit weird.
>>
>> Why is it weird? In general, you only want to map exactly what you need. Not 
>> less, not more.
>>
>> In your situation, you only care about two pages, the console page and the 
>> xenstore page. The rest shouldn't be mapped.
> Ok, so I'll try get pfns for HVM_PARAM_CONSOLE_PFN + XENSTORE_PFN_OFFSET via 
> hyp call and map those
>>
>>> I need that constant badly ;)
>>
>> No you don't ;).

We have managed to make use of the relevant hypercalls to get the PFNs, but for 
that

we need to maintain the caches as this happens (the calls) when MMU is not yet

setup and is in the process of.

>>
>> Cheers,
>>
> Thanks for helping with this

Re: [PATCH] x86/CPUID: fill all fields in x86_cpuid_policy_fill_native()

2020-06-22 Thread Andrew Cooper
On 22/06/2020 14:37, Jan Beulich wrote:
> On 22.06.2020 14:39, Andrew Cooper wrote:
>> On 22/06/2020 13:09, Jan Beulich wrote:
>>> Coverity validly complains that the new call from
>>> tools/tests/cpu-policy/test-cpu-policy.c:test_cpuid_current() leaves
>>> two fields uninitialized, yet they get then consumed by
>>> x86_cpuid_copy_to_buffer(). (All other present callers of the function
>>> pass a pointer to a static - and hence initialized - buffer.)
>>>
>>> Coverity-ID: 1464809
>>> Fixes: c22ced93e167 ("tests/cpu-policy: Confirm that CPUID serialisation is 
>>> sorted")
>>> Signed-off-by: Jan Beulich 
>>>
>>> --- a/xen/lib/x86/cpuid.c
>>> +++ b/xen/lib/x86/cpuid.c
>>> @@ -176,6 +176,10 @@ void x86_cpuid_policy_fill_native(struct
>>>ARRAY_SIZE(p->extd.raw) - 1); ++i )
>>>  cpuid_leaf(0x8000 + i, &p->extd.raw[i]);
>>>  
>>> +/* Don't report leaves from possible lower level hypervisor. */
>> ", for now."
>>
>> This will change in the future.
> I was pondering at that moment whether to add it, but then I didn't
> think we'd want to let shine through lower level hypervisor info.
> But yes, I've added it, because it won't be wrong to say "for now",
> even if it end up being for much longer.

It won't be very much longer.

I don't intend to let it "shine though", but the outer hypervisors data
should be in the raw/host policy, just as the Xen Xen/Viridian leaves
need to be in the guest policies.

This is the longterm plan to handle Viridian features, because the
existing HVMPARAM simply isn't expressive enough.

~Andrew



Re: Event delivery and "domain blocking" on PVHv2

2020-06-22 Thread Roger Pau Monné
On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
> On 2020-06-19 19:42, Roger Pau Monné wrote:
> > On Fri, Jun 19, 2020 at 06:54:26PM +0200, Roger Pau Monné wrote:
> > > On Fri, Jun 19, 2020 at 06:41:21PM +0200, Martin Lucina wrote:
> > > > On 2020-06-19 13:21, Roger Pau Monné wrote:
> > > > > On Fri, Jun 19, 2020 at 12:28:50PM +0200, Martin Lucina wrote:
> > > > > > On 2020-06-18 13:46, Roger Pau Monné wrote:
> > > > > > > On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
> > > > > > > > At this point I don't really have a clear idea of how to 
> > > > > > > > progress,
> > > > > > > > comparing my implementation side-by-side with the original PV
> > > > > > > > Mini-OS-based
> > > > > > > > implementation doesn't show up any differences I can see.
> > > > > > > >
> > > > > > > > AFAICT the OCaml code I've also not changed in any material 
> > > > > > > > way, and
> > > > > > > > that
> > > > > > > > has been running in production on PV for years, so I'd be 
> > > > > > > > inclined
> > > > > > > > to think
> > > > > > > > the problem is in my reimplementation of the C parts, but 
> > > > > > > > where...?
> > > > > > >
> > > > > > > A good start would be to print the ISR and IRR lapic registers 
> > > > > > > when
> > > > > > > blocked, to assert there are no pending vectors there.
> > > > > > >
> > > > > > > Can you apply the following patch to your Xen, rebuild and check 
> > > > > > > the
> > > > > > > output of the 'l' debug key?
> > > > > > >
> > > > > > > Also add the output of the 'v' key.
> > > > > >
> > > > > > Had to fight the Xen Debian packages a bit as I wanted to patch the
> > > > > > exact
> > > > > > same Xen (there are some failures when building on a system that has
> > > > > > Xen
> > > > > > installed due to following symlinks when fixing shebangs).
> > > > > >
> > > > > > Here you go, when stuck during netfront setup, after allocating its
> > > > > > event
> > > > > > channel, presumably waiting on Xenstore:
> > > > > >
> > > > > > 'e':
> > > > > >
> > > > > > (XEN) Event channel information for domain 3:
> > > > > > (XEN) Polling vCPUs: {}
> > > > > > (XEN) port [p/m/s]
> > > > > > (XEN)1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> > > > > > (XEN)2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> > > > > > (XEN)3 [1/0/1]: s=5 n=0 x=0 v=0
> > > > > > (XEN)4 [0/1/1]: s=2 n=0 x=0 d=0
> > > > > >
> > > > > > 'l':
> > > > > >
> > > > > > (XEN) d3v0 IRR:
> > > > > > 8301732dc200b
> > > > > > (XEN) d3v0 ISR:
> > > > > > 8301732dc100b
> > > > >
> > > > > Which version of Xen is this? AFAICT it doesn't have the support to
> > > > > print a bitmap.
> > > >
> > > > That in Debian 10 (stable):
> > > >
> > > > ii  xen-hypervisor-4.11-amd64
> > > > 4.11.3+24-g14b62ab3e5-1~deb10u1.2
> > > > amd64Xen Hypervisor on AMD64
> > > >
> > > > xen_major  : 4
> > > > xen_minor  : 11
> > > > xen_extra  : .4-pre
> > > > xen_version: 4.11.4-pre
> > > >
> > > > >
> > > > > Do you think you could also pick commit
> > > > > 8cd9500958d818e3deabdd0d4164ea6fe1623d7c [0] and rebuild? (and print
> > > > > the info again).
> > > >
> > > > Done, here you go:
> > > >
> > > > (XEN) Event channel information for domain 3:
> > > > (XEN) Polling vCPUs: {}
> > > > (XEN) port [p/m/s]
> > > > (XEN)1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> > > > (XEN)2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> > > > (XEN)3 [1/0/1]: s=5 n=0 x=0 v=0
> > > > (XEN)4 [0/1/1]: s=3 n=0 x=0 d=0 p=35
> > > >
> > > >
> > > > (XEN) d3v0 IRR:
> > > > ,,,,,,,
> > > > (XEN) d3v0 ISR:
> > > > ,,,,,,,
> > > 
> > > So there's nothing pending on the lapic. Can you assert that you will
> > > always execute evtchn_demux_pending after you have received an event
> > > channel interrupt (ie: executed solo5__xen_evtchn_vector_handler)?
> > > 
> > > I think this would be simpler if you moved evtchn_demux_pending into
> > > solo5__xen_evtchn_vector_handler? As there would be less asynchronous
> > > processing, and thus likely less races?
> > 
> > Having though about this, I think this model of not demuxing in
> > solo5__xen_evtchn_vector_handler is always racy, as it's not possible
> > to assert that you would always call evtchn_demux_pending after
> > solo5__xen_evtchn_vector_handler?
> > 
> > Ie: if you receive an interrupt just before going to sleep (after the
> > sti and before the hlt) you will execute
> > solo5__xen_evtchn_vector_handler and EOI the vector, but then
> > evtchn_demux_pending will never get called, and thus the interrupts
> > will stay indefinitely pending?
> 
> Aha! Thank you for pointing this out. I think you may be right, but this
> should be possible without doing the demuxing in interrupt context.

If you don't do the demuxing in the interrupt context (ie: making the
interrupt handler a noop),

Re: UEFI support in ARM DomUs

2020-06-22 Thread Oleksandr Andrushchenko

On 6/19/20 11:02 PM, Stefano Stabellini wrote:
> On Thu, 18 Jun 2020, Julien Grall wrote:
>>> Copy/pasting here from my comment on the pull request plus additional
>>> thoughts.
>>>
>>> Uboot is one of those embedded projects that typically assumes that all
>>> the information about the platform is available at *build time*. It is
>>> meant to be built tailored for a specific version of a specific board. A
>>> Uboot binary is not expected to be "portable" across different versions
>>> of the platform or different platforms. In other words, it is not
>>> expected to be portable across Xen versions.
>> Can you define "version" here? Do you refer to the difference in terms
>> of memory?
>   
> Yes, I meant any meaningful differences in any of the external
> interfaces that end up impacting the UBoot implementation. A primary
> example would be the memory addresses for instance.
>
>
>>> This is a different model meant for different use-cases. I don't think
>>> it is a problem "philosophically" to let Uboot know about Xen details at
>>> build time. Yes, that is not what we did historically but it is very
>>> much in the spirit of Uboot.
>> TBH, I don't particularly mind that U-boot is built against a specific
>> version of Xen. I am more concerned about the long term implication if
>> we endorse it
>> and then try to change the memory layout in depth.
> I'll provide more information below.
>
>
>>> But of course the least Uboot depends on these details the better
>>> because it will be more flexible, but it could very well be that we
>>> won't be able to reach the point where the binary works on any version
>>> like we did with Tianocore. The two projects work differently.
>> Can we have a list of things U-boot require to know at compile time?
>>
>> In particular, I would like to understand if it would be sufficient to
>> only be aware of the first bank. If it is, then my preference would be
>> to standardize that bit of the layout.
> That would be my preference too, and it was great to read that it looks
> like it can be done. Yay!
>
>
>>> That said, I think Julien is right that we need to be careful on how we
>>> expose these information to Uboot, i.e. defining __XEN__ to build Uboot
>>> doesn't seem like a good idea because we enable definitions that were
>>> never meant to be used outside of a Xen build. Also, it wouldn't be easy
>>> to know exactly which definitions are actively used by Uboot and which
>>> ones aren't.
>>>
>>> If we are going to make Uboot dependent on version-specific information
>>> of the Xen interface, it would be better to be very clear about which
>>> definitions we are using. So that one day if we need to change them, we
>>> can find them easily.
>>>
>>> So I think it would be better to introduce a set of defines with the
>>> minimum amount of information required by Uboot statically. That way,
>>> at least we have a single place where to find all the version-specific
>>> definitions that Uboot is using.
>> I am not sure what you are suggesting. Can you expand a bit more?
>>
>>> We can also manage versioning of the
>>> Xen interface (like we do in QEMU) if we have to.
>> Can you provide more details about the compatibility? I am quite
>> interested in the part where you would have to deal with an older QEMU
>> version built against a new Xen?
> Sure let me expand a bit more. I'll switch "hat" to "QEMU (or UBoot)
> contributor" for the sake of this discussion.
>
> Xen Project offers certain stability guarantees on some interfaces and
> not others. Let's say that for any reason you have to or want to use one
> of the unstable interfaces in QEMU/UBoot/Whatever. Then it becomes your
> responsibility as QEMU developer to maintain compatibility in QEMU
> itself.
>
> First I'd add code to detect the version of the interface. The detection
> is based on the Xen headers/libraries provided. In QEMU it is done in
> the "configure" script. It sets a preprocessor #define to the version
> of the interface (e.g. CONFIG_XEN_CTRL_INTERFACE_VERSION == 41100).

I looked at QEMU's configure script and I'm afraid we can't have the

same flexibility in U-boot: we don't have configure script, pkg-config

is not widely used etc. So, for now we have hardcoded:

ifeq ($(CONFIG_XEN),y)
arch-y += -D__XEN_INTERFACE_VERSION__=0x00040d00
endif

and we also have Xen 4.13 headers in the U-boot tree.

For the first part (__XEN_INTERFACE_VERSION__) I think we can provide it via

CFLAGS or something. This can also be done for the location of Xen headers.

We don't have strong opinion here, so would love to hear from Xen community on 
this

Current WIP with the fixes requested by Julien are at [1]: can be and will be 
force pushed

as we are still working on it, but can give you an impression of what we have 
as of now

>
> Then you can have preprocessors checks in one of the headers such as:
>
> #if CONFIG_XEN_CTRL_INTERFACE_VERSION < 40701
>  #define xenevtchn_open(l, f) xc_evtchn_open(l, f);
> #else
>  XXX
>

Re: UEFI support in ARM DomUs

2020-06-22 Thread Julien Grall




On 22/06/2020 14:56, Oleksandr Andrushchenko wrote:


On 6/19/20 4:29 PM, Oleksandr Andrushchenko wrote:

On 6/19/20 4:15 PM, Julien Grall wrote:



On 19/06/2020 14:06, Oleksandr Andrushchenko wrote:


On 6/19/20 3:59 PM, Julien Grall wrote:

Hi,

On 19/06/2020 13:51, Oleksandr Andrushchenko wrote:

On 6/19/20 3:47 PM, Julien Grall wrote:

They will not be available from the fdt, but you can retrieve them with an 
hypervisor call (see HVM_PARAM_STORE_PFN, HVM_PARAM_CONSOLE_PFN).

Yes, and it used in the relevant pieces of code (hyp calls)

One question though, why do you need to map them in advance? Couldn't you map 
them on demand?


Well, we need to at least estimate the pg_table size so we can reserve and 
allocate memory later,


Oh, so U-boot doesn't support runtime page-table table allocation. Is that 
right?

As per my understanding no, we provide a memory map and the tables are 
allocated beforehand


Ok :(.





so I have to provide memory range from either by coding a constant or looking 
into the devtree at

hypervisor { reg = <>; }. It is a bit tricky though


Looking for a node in the device-tree shouldn't be too difficult given that you 
have fdt_* available.

However, please not that  doesn't refer to the guest magic pages. Instead, 
it provides a region you can use for mapping the grant-table frames


Indeed, this is in my case 0x3800, but the magic is at 0x3900

So, I need the memory range set up beforehand, but I can't as there is no cute 
way to get that.

Of course, I can issue a hyp call to get HVM_PARAM_CONSOLE_PFN and use it as 
the base address,

but this smells like a hack. I can call other HVM_PARAM_ to get their pfns and 
set up the memory regions,

but this looks a bit weird.


Why is it weird? In general, you only want to map exactly what you need. Not 
less, not more.

In your situation, you only care about two pages, the console page and the 
xenstore page. The rest shouldn't be mapped.

Ok, so I'll try get pfns for HVM_PARAM_CONSOLE_PFN + XENSTORE_PFN_OFFSET via 
hyp call and map those



I need that constant badly ;)


No you don't ;).


We have managed to make use of the relevant hypercalls to get the PFNs, but for 
that

we need to maintain the caches as this happens (the calls) when MMU is not yet

setup and is in the process of.


Glad to hear it works. Yes, that's unfortunately the drawback of using 
hypercalls with MMU off. But at least you make U-boot more generic :).


Cheers,

--
Julien Grall



Re: UEFI support in ARM DomUs

2020-06-22 Thread Julien Grall

Hi Oleksandr,

On 22/06/2020 15:04, Oleksandr Andrushchenko wrote:

On 6/19/20 11:02 PM, Stefano Stabellini wrote:

On Thu, 18 Jun 2020, Julien Grall wrote:

ifeq ($(CONFIG_XEN),y)
arch-y += -D__XEN_INTERFACE_VERSION__=0x00040d00
endif

and we also have Xen 4.13 headers in the U-boot tree.


Sorry if this was already asked before. Why do you need to specify 
__XEN_INTERFACE_VERSION__?




For the first part (__XEN_INTERFACE_VERSION__) I think we can provide it via

CFLAGS or something. This can also be done for the location of Xen headers.


__XEN_INTERFACE_VERSION__ should work through the CFLAGS. An alternative 
would be to allow the user to specify through the Kconfig.

Cheers,

--
Julien Grall



Re: UEFI support in ARM DomUs

2020-06-22 Thread Oleksandr Andrushchenko

On 6/22/20 5:27 PM, Julien Grall wrote:
> Hi Oleksandr,
>
> On 22/06/2020 15:04, Oleksandr Andrushchenko wrote:
>> On 6/19/20 11:02 PM, Stefano Stabellini wrote:
>>> On Thu, 18 Jun 2020, Julien Grall wrote:
>> ifeq ($(CONFIG_XEN),y)
>> arch-y += -D__XEN_INTERFACE_VERSION__=0x00040d00
>> endif
>>
>> and we also have Xen 4.13 headers in the U-boot tree.
>
> Sorry if this was already asked before. Why do you need to specify 
> __XEN_INTERFACE_VERSION__?

This is because of include/xen/interface/xen-compat.h:

#if defined(__XEN__) || defined(__XEN_TOOLS__)

/* Xen is built with matching headers and implements the latest interface. */
#define __XEN_INTERFACE_VERSION__ __XEN_LATEST_INTERFACE_VERSION__
#elif !defined(__XEN_INTERFACE_VERSION__)
/* Guests which do not specify a version get the legacy interface. */
#define __XEN_INTERFACE_VERSION__ 0x
#endif

So, one needs to specify the version (QEMU does that via its configure script 
after

some tests)

>
>>
>> For the first part (__XEN_INTERFACE_VERSION__) I think we can provide it via
>>
>> CFLAGS or something. This can also be done for the location of Xen headers.
>
> __XEN_INTERFACE_VERSION__ should work through the CFLAGS. An alternative 
> would be to allow the user to specify through the Kconfig.

You mean specifying via Kconfig something like "0x00040d00"?

And what about the headers? How will we provide their location if we decide not 
to include those

in the tree?

> Cheers,
>

Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Michał Leszczyński
- 22 cze 2020 o 15:25, Jan Beulich jbeul...@suse.com napisał(a):

> On 19.06.2020 01:41, Michał Leszczyński wrote:
>> +
>> +domain_pause(d);
> 
> Who's the intended caller of this interface? You making it a hvm-op
> suggests the guest may itself call this. But of course a guest
> can't pause itself. If this is supposed to be a tools-only interface,
> then you should frame it suitably in the public header and of course
> you need to enforce this here (which would e.g. mean you shouldn't
> use rcu_lock_domain_by_any_id()).
> 

What should I use instead of rcu_lock_domain_by_and_id()?

> Also please take a look at hvm/ioreq.c, which makes quite a bit of
> use of domain_pause(). In particular I think you want to acquire
> the lock only after having paused the domain.
> 

This domain_pause() will be changed to vcpu_pause().

> Shouldn't you rather remove the MSR from the load list here?
> 

This will be fixed.

> Is any of what you do in this switch() actually legitimate without
> hvm_set_vmtrace_pt_size() having got called for the guest? From
> remarks elsewhere I imply you expect the param that you currently
> use to be set upon domain creation time, but at the very least the
> potentially big buffer should imo not get allocated up front, but
> only when tracing is to actually be enabled.

Wait... so you want to allocate these buffers in runtime?
Previously we were talking that there is too much runtime logic
and these enable/disable hypercalls should be stripped to absolute
minimum.


>> --- a/xen/include/public/hvm/hvm_op.h
>> +++ b/xen/include/public/hvm/hvm_op.h
>> @@ -382,6 +382,29 @@ struct xen_hvm_altp2m_op {
>>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>>  
>> +/* HVMOP_vmtrace: Perform VM tracing related operation */
>> +#define HVMOP_vmtrace 26
>> +
>> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x0001
> 
> I'm unconvinced we want to introduce yet another versioned interface.
> In any event, as hinted at earlier, this suggests it wants to be a
> tools-only interface instead (which, at least for the time being, is
> not required to be a stable interface then, but that's also something
> we apparently want to move away from, and hence you may better not
> try to rely on it not needing to be stable).

Ok. I will remove the interface version.

> 
>> +struct xen_hvm_vmtrace_op {
>> +/* IN variable */
>> +uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
>> +uint32_t cmd;
>> +/* Enable/disable external vmtrace for given domain */
>> +#define HVMOP_vmtrace_ipt_enable  1
>> +#define HVMOP_vmtrace_ipt_disable 2
>> +#define HVMOP_vmtrace_ipt_get_offset  3
>> +domid_t domain;
>> +uint32_t vcpu;
>> +uint64_t size;
>> +
>> +/* OUT variable */
>> +uint64_t offset;
> 
> If this is to be a tools-only interface, please use uint64_aligned_t.
> 

This type is not defined within hvm_op.h header. What should I do about it?

> You also want to add an entry to xen/include/xlat.lst and use the
> resulting macro to prove that the struct layout is the same for
> native and compat callers.

Could you tell a little bit more about this? What are "native" and
"compat" callers and what is the purpose of this file?


Best regards,
Michał Leszczyński
CERT Polska



Re: [PATCH for-4.14] x86/tlb: fix assisted flush usage

2020-06-22 Thread Roger Pau Monné
On Mon, Jun 22, 2020 at 03:51:24PM +0200, Jan Beulich wrote:
> On 22.06.2020 15:24, Roger Pau Monné wrote:
> > On Mon, Jun 22, 2020 at 01:07:10PM +0200, Jan Beulich wrote:
> >> On 22.06.2020 11:31, Roger Pau Monné wrote:
> >>> On Fri, Jun 19, 2020 at 04:06:55PM +0200, Jan Beulich wrote:
>  On 18.06.2020 18:04, Roger Pau Monne wrote:
> > Commit e9aca9470ed86 introduced a regression when avoiding sending
> > IPIs for certain flush operations. Xen page fault handler
> > (spurious_page_fault) relies on blocking interrupts in order to
> > prevent handling TLB flush IPIs and thus preventing other CPUs from
> > removing page tables pages. Switching to assisted flushing avoided such
> > IPIs, and thus can result in pages belonging to the page tables being
> > removed (and possibly re-used) while __page_fault_type is being
> > executed.
> >
> > Force some of the TLB flushes to use IPIs, thus avoiding the assisted
> > TLB flush. Those selected flushes are the page type change (when
> > switching from a page table type to a different one, ie: a page that
> > has been removed as a page table) and page allocation. This sadly has
> > a negative performance impact on the pvshim, as less assisted flushes
> > can be used.
> >
> > Introduce a new flag (FLUSH_FORCE_IPI) and helper to force a TLB flush
> > using an IPI (flush_tlb_mask_sync). Note that the flag is only
> > meaningfully defined when the hypervisor supports PV mode, as
> > otherwise translated domains are in charge of their page tables and
> > won't share page tables with Xen, thus not influencing the result of
> > page walks performed by the spurious fault handler.
> 
>  Is this true for shadow mode? If a page shadowing a guest one was
>  given back quickly enough to the allocator and then re-used, I think
>  the same situation could in principle arise.
> >>>
> >>> Hm, I think it's not applicable to HVM shadow mode at least, because
> >>> CR3 is switched as part of vmentry/vmexit, and the page tables are not
> >>> shared between Xen and the guest, so there's no way for a HVM shadow
> >>> guest to modify the page-tables while Xen is walking them in
> >>> spurious_page_fault (note spurious_page_fault is only called when the
> >>> fault happens in non-guest context).
> >>
> >> I'm afraid I disagree, because of shadow's use of "linear page tables".
> > 
> > You will have to bear with me, but I don't follow.
> > 
> > Could you provide some pointers at how/where the shadow (I assume
> > guest controlled) "linear page tables" are used while in Xen
> > context?
> 
> See config.h:
> 
> /* Slot 258: linear page table (guest table). */
> #define LINEAR_PT_VIRT_START(PML4_ADDR(258))
> #define LINEAR_PT_VIRT_END  (LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)
> /* Slot 259: linear page table (shadow table). */
> #define SH_LINEAR_PT_VIRT_START (PML4_ADDR(259))
> #define SH_LINEAR_PT_VIRT_END   (SH_LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)
> 
> These linear page tables exist in the Xen page tables at basically
> all times as long as a shadow guest's vCPU is in context. They're
> there to limit the overhead of accessing guest page tables and
> their shadows from inside Xen.

Oh, I have to admit I know very little about all this, and I'm not
able to find a description of how this is to be used.

I think the shadow linear page tables should be per-pCPU, and hence
they cannot be modified by the guest while a spurious page fault is
being processed? (since the vCPU running on the pCPU is in Xen
context).

Thanks, Roger.



Xen 4.14 RC3

2020-06-22 Thread Paul Durrant
Hi all,

Xen 4.14 RC3 is tagged. You can check that out from xen.git:

git://xenbits.xen.org/xen.git 4.14.0-rc3

For your convenience there is also a tarball at:
https://downloads.xenproject.org/release/xen/4.14.0-rc3/xen-4.14.0-rc3.tar.gz

And the signature is at:
https://downloads.xenproject.org/release/xen/4.14.0-rc3/xen-4.14.0-rc3.tar.gz.sig

Please send bug reports and test reports to xen-devel@lists.xenproject.org.
When sending bug reports, please CC relevant maintainers and me (p...@xen.org).

As a reminder, there will be a Xen Test Day. Please see the test day schedule at
https://wiki.xenproject.org/wiki/Xen_Project_Test_Days and test instructions at
https://wiki.xenproject.org/wiki/Xen_4.14_RC_test_instructions.

  Paul Durrant




Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Jan Beulich
On 22.06.2020 16:35, Michał Leszczyński wrote:
> - 22 cze 2020 o 15:25, Jan Beulich jbeul...@suse.com napisał(a):
>> On 19.06.2020 01:41, Michał Leszczyński wrote:
>>> +
>>> +domain_pause(d);
>>
>> Who's the intended caller of this interface? You making it a hvm-op
>> suggests the guest may itself call this. But of course a guest
>> can't pause itself. If this is supposed to be a tools-only interface,
>> then you should frame it suitably in the public header and of course
>> you need to enforce this here (which would e.g. mean you shouldn't
>> use rcu_lock_domain_by_any_id()).
>>
> 
> What should I use instead of rcu_lock_domain_by_and_id()?

Please take a look at the header where its declaration lives. It's
admittedly not the usual thing in Xen, but there are even comments
describing the differences between the four related by-id functions.
I guess rcu_lock_live_remote_domain_by_id() is the one you want to
use, despite being puzzled by there being surprisingly little uses
elsewhere.

>> Also please take a look at hvm/ioreq.c, which makes quite a bit of
>> use of domain_pause(). In particular I think you want to acquire
>> the lock only after having paused the domain.
> 
> This domain_pause() will be changed to vcpu_pause().

And you understand that my comment then still applies?

>> Shouldn't you rather remove the MSR from the load list here?
> 
> This will be fixed.

Thanks for trimming your reply, but I think you've gone too far:
Context should still be such that one can see what the comments
are about without having to go back to the original mail. Please
try to find some middle ground.

>> Is any of what you do in this switch() actually legitimate without
>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>> remarks elsewhere I imply you expect the param that you currently
>> use to be set upon domain creation time, but at the very least the
>> potentially big buffer should imo not get allocated up front, but
>> only when tracing is to actually be enabled.
> 
> Wait... so you want to allocate these buffers in runtime?
> Previously we were talking that there is too much runtime logic
> and these enable/disable hypercalls should be stripped to absolute
> minimum.

Basic arrangements can be made at domain creation time. I don't
think though that it would be a good use of memory if you
allocated perhaps many gigabytes of memory just for possibly
wanting to enable tracing on a guest. 

>>> +struct xen_hvm_vmtrace_op {
>>> +/* IN variable */
>>> +uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
>>> +uint32_t cmd;
>>> +/* Enable/disable external vmtrace for given domain */
>>> +#define HVMOP_vmtrace_ipt_enable  1
>>> +#define HVMOP_vmtrace_ipt_disable 2
>>> +#define HVMOP_vmtrace_ipt_get_offset  3
>>> +domid_t domain;
>>> +uint32_t vcpu;
>>> +uint64_t size;
>>> +
>>> +/* OUT variable */
>>> +uint64_t offset;
>>
>> If this is to be a tools-only interface, please use uint64_aligned_t.
>>
> 
> This type is not defined within hvm_op.h header. What should I do about it?

It gets defined by xen.h, so should be available here. Its
definitions live in a

#if defined(__XEN__) || defined(__XEN_TOOLS__)

section, which is what I did recommend to put your interface in
as well. Unless you want this to be exposed to the guest itself,
at which point further constraints would arise.

>> You also want to add an entry to xen/include/xlat.lst and use the
>> resulting macro to prove that the struct layout is the same for
>> native and compat callers.
> 
> Could you tell a little bit more about this? What are "native" and
> "compat" callers and what is the purpose of this file?

A native caller is one whose bitness matches Xen's, i.e. for x86
a guest running in 64-bit mode. A compat guest is one running in
32-bit mode. Interface structure layout, at least for historical
reasons, can differ between the two. If it does, these
structures need translation. In the case here the layouts look
to match, which we still want to be verified at build time. If
you add a suitable line to xlat.lst starting with a ?, a macro
will be generated that you can then invoke somewhere near the
code that actually handles this sub-hypercall. See e.g. the top
of xen/common/hypfs.c for a relatively recent addition of such.

Jan



Re: Event delivery and "domain blocking" on PVHv2

2020-06-22 Thread Martin Lucina

On 2020-06-22 15:58, Roger Pau Monné wrote:

On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
Aha! Thank you for pointing this out. I think you may be right, but 
this

should be possible without doing the demuxing in interrupt context.


If you don't do the demuxing in the interrupt context (ie: making the
interrupt handler a noop), then you don't likely need such interrupt
anyway?


I need the/an interrupt to wake the VCPU from HLT state if we went to 
sleep.




How about this arrangement, which appears to work for me; no hangs I 
can see

so far and domU survives ping -f fine with no packet loss:

CAMLprim value
mirage_xen_evtchn_block_domain(value v_deadline)
{
struct vcpu_info *vi = VCPU0_INFO();
solo5_time_t deadline = Int64_val(v_deadline);

if (solo5_clock_monotonic() < deadline) {
__asm__ __volatile__ ("cli" : : : "memory");
if (vi->evtchn_upcall_pending) {
__asm__ __volatile__ ("sti");
}
else {
hypercall_set_timer_op(deadline);


What if you set a deadline so close that evtchn_upcall_pending gets
set by Xen here and the interrupt is injected? You would execute the
noop handler and just hlt, and could likely end up in the same blocked
situation as before?


Why would an interrupt be injected here? Doesn't the immediately 
preceding

"cli" disable that?

Or perhaps I need to do a PV/HVM hybrid and set vi->evtchn_upcall_mask 
just

before the cli, and clear it after the sti?


i.e. Always go to sleep with interrupts disabled, but before doing so
re-check that no events have become pending since the last time
evtchn_demux_pending() was called. This holds, since the only thing 
that
sets vi->evtchn_upcall_pending is Xen, and the only thing that clears 
it is

evtchn_demux_pending().

Right?


TBH this is a hard model to get right, I think your best bet at
attempting something along this lines is to forget about using the
event channel interrupt and instead use SCHEDOP_poll. You could do
something like (written in pure C as I have no idea of the ocaml
bindings):
[SCHEDOP_poll code snipped]


Thanks for the suggestion. This brings us full-circle -- I found [1] and
[2] way back from 2013 when Mirage/Xen was initially using SCHEDOP_poll
and then switched to the current interrupts + SCHEDOP_block approach.

Part of the motivation for the change at the time was to allow waiting
on/servicing more than 128 ports (the limit for SCHEDOP_poll). I doubt
anyone wants to do that these days, but it still makes me a bit 
reluctant

to change back to SCHEDOP_poll.

In an attempt to understand why the original PV code worked I re-read 
the PV
Mini-OS block_domain code again and realised that I had entirely 
missed one

part of its behaviour, which is that it intends[*] to run with
interrupts/upcalls disabled *all* the time and relies on SCHEDOP_block
atomically re-enabling them and triggering an upcall before returning 
(PV)
or "briefly enabling interrupts to allow handlers to run" (HVM). We're 
doing
the inverse, but our behaviour matches my mental model of how things 
should

work.


Not really IMO, as SCHEDOP_block is a single 'instruction' from a
guest PoV that does the enabling of interrupts and returns if there
are pending ones.


Indeed and this is exactly the operation I need in the HVM world with 
the

current model.


Also SCHEDOP_block is not exactly the same on HVM, as it just checks
for pending vectors to inject, but not for pending event channels.


... well, I don't want to use SCHEDOP_block anyway since that is not 
possible
on ARM, and I would like to keep the code at least "portable with not 
too

much effort". So there should be a non-racy "HVM way" to do this?


HVM you cannot call hlt with interrupts disabled, or the vCPU will be
taken down.

There are quite a lot of subtle differences between PV and HVM in this
regard, and I think the best approach would be to use SCHEDOP_poll in
order to implement the kind of model you describe.


I can see that now. The interactions between the "virtual hardware"
(interrupt delivery, VCPU IF) and "PV" parts (event delivery, masking) 
are

hard to understand for me, yet the two are intertwined in the way HVM
works :-(

Coming back to your earlier suggestion of moving the event demuxing (but 
not
the handling) into the upcall interrupt handler itself, perhaps that 
approach
is still worth investigating in combination with re-working the OCaml 
side array
view of pending events, or at least ensuring that atomic operations are 
used

since it would now be updated asynchronously.

Martin

[1] 
https://lists.cam.ac.uk/pipermail/cl-mirage/2013-September/msg00053.html

[2] https://github.com/mirage/mirage-platform/pull/58



Re: [PATCH for-4.14] x86/tlb: fix assisted flush usage

2020-06-22 Thread Jan Beulich
On 22.06.2020 16:56, Roger Pau Monné wrote:
> On Mon, Jun 22, 2020 at 03:51:24PM +0200, Jan Beulich wrote:
>> On 22.06.2020 15:24, Roger Pau Monné wrote:
>>> On Mon, Jun 22, 2020 at 01:07:10PM +0200, Jan Beulich wrote:
 On 22.06.2020 11:31, Roger Pau Monné wrote:
> On Fri, Jun 19, 2020 at 04:06:55PM +0200, Jan Beulich wrote:
>> On 18.06.2020 18:04, Roger Pau Monne wrote:
>>> Commit e9aca9470ed86 introduced a regression when avoiding sending
>>> IPIs for certain flush operations. Xen page fault handler
>>> (spurious_page_fault) relies on blocking interrupts in order to
>>> prevent handling TLB flush IPIs and thus preventing other CPUs from
>>> removing page tables pages. Switching to assisted flushing avoided such
>>> IPIs, and thus can result in pages belonging to the page tables being
>>> removed (and possibly re-used) while __page_fault_type is being
>>> executed.
>>>
>>> Force some of the TLB flushes to use IPIs, thus avoiding the assisted
>>> TLB flush. Those selected flushes are the page type change (when
>>> switching from a page table type to a different one, ie: a page that
>>> has been removed as a page table) and page allocation. This sadly has
>>> a negative performance impact on the pvshim, as less assisted flushes
>>> can be used.
>>>
>>> Introduce a new flag (FLUSH_FORCE_IPI) and helper to force a TLB flush
>>> using an IPI (flush_tlb_mask_sync). Note that the flag is only
>>> meaningfully defined when the hypervisor supports PV mode, as
>>> otherwise translated domains are in charge of their page tables and
>>> won't share page tables with Xen, thus not influencing the result of
>>> page walks performed by the spurious fault handler.
>>
>> Is this true for shadow mode? If a page shadowing a guest one was
>> given back quickly enough to the allocator and then re-used, I think
>> the same situation could in principle arise.
>
> Hm, I think it's not applicable to HVM shadow mode at least, because
> CR3 is switched as part of vmentry/vmexit, and the page tables are not
> shared between Xen and the guest, so there's no way for a HVM shadow
> guest to modify the page-tables while Xen is walking them in
> spurious_page_fault (note spurious_page_fault is only called when the
> fault happens in non-guest context).

 I'm afraid I disagree, because of shadow's use of "linear page tables".
>>>
>>> You will have to bear with me, but I don't follow.
>>>
>>> Could you provide some pointers at how/where the shadow (I assume
>>> guest controlled) "linear page tables" are used while in Xen
>>> context?
>>
>> See config.h:
>>
>> /* Slot 258: linear page table (guest table). */
>> #define LINEAR_PT_VIRT_START(PML4_ADDR(258))
>> #define LINEAR_PT_VIRT_END  (LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)
>> /* Slot 259: linear page table (shadow table). */
>> #define SH_LINEAR_PT_VIRT_START (PML4_ADDR(259))
>> #define SH_LINEAR_PT_VIRT_END   (SH_LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)
>>
>> These linear page tables exist in the Xen page tables at basically
>> all times as long as a shadow guest's vCPU is in context. They're
>> there to limit the overhead of accessing guest page tables and
>> their shadows from inside Xen.
> 
> Oh, I have to admit I know very little about all this, and I'm not
> able to find a description of how this is to be used.
> 
> I think the shadow linear page tables should be per-pCPU, and hence
> they cannot be modified by the guest while a spurious page fault is
> being processed? (since the vCPU running on the pCPU is in Xen
> context).

A guest would have for some linear address e.g.

vCR3 -> G4 -> G3 -> G2 -> G1

visible to some random set of its CPUs (i.e. the same CR3 can be in
use by multiple vCPU-s). This will then have shadows like this:

pCR3 -> S4 -> S3 -> S2 -> S1

The G4 page gets hooked up into LINEAR_PT (i.e. becomes an L3 page)
for all vCPU-s that have this very CR3 active. Same goes for S4 and
SH_LINEAR_PT respectively. Afaik shadows of guest page tables also
don't get created on a per-pCPU basis - a page table either has a
shadow, or it doesn't.

Hence afaict page tables mapped through these facilities (and
reachable while in Xen) can very well be modified "behind our backs".

Jan



Re: Event delivery and "domain blocking" on PVHv2

2020-06-22 Thread Jan Beulich
On 22.06.2020 17:31, Martin Lucina wrote:
> On 2020-06-22 15:58, Roger Pau Monné wrote:
>> On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
>>> How about this arrangement, which appears to work for me; no hangs I 
>>> can see
>>> so far and domU survives ping -f fine with no packet loss:
>>>
>>> CAMLprim value
>>> mirage_xen_evtchn_block_domain(value v_deadline)
>>> {
>>> struct vcpu_info *vi = VCPU0_INFO();
>>> solo5_time_t deadline = Int64_val(v_deadline);
>>>
>>> if (solo5_clock_monotonic() < deadline) {
>>> __asm__ __volatile__ ("cli" : : : "memory");
>>> if (vi->evtchn_upcall_pending) {
>>> __asm__ __volatile__ ("sti");
>>> }
>>> else {
>>> hypercall_set_timer_op(deadline);
>>
>> What if you set a deadline so close that evtchn_upcall_pending gets
>> set by Xen here and the interrupt is injected? You would execute the
>> noop handler and just hlt, and could likely end up in the same blocked
>> situation as before?
> 
> Why would an interrupt be injected here? Doesn't the immediately 
> preceding
> "cli" disable that?
> 
> Or perhaps I need to do a PV/HVM hybrid and set vi->evtchn_upcall_mask 
> just
> before the cli, and clear it after the sti?

evtchn_upcall_mask is a strictly PV-only thing. See e.g. the code
comment in hvm_set_callback_irq_level().

Jan



Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Michał Leszczyński
- 22 cze 2020 o 17:22, Jan Beulich jbeul...@suse.com napisał(a):

> On 22.06.2020 16:35, Michał Leszczyński wrote:
>> - 22 cze 2020 o 15:25, Jan Beulich jbeul...@suse.com napisał(a):
>>> On 19.06.2020 01:41, Michał Leszczyński wrote:
 +
 +domain_pause(d);
>>>
>>> Who's the intended caller of this interface? You making it a hvm-op
>>> suggests the guest may itself call this. But of course a guest
>>> can't pause itself. If this is supposed to be a tools-only interface,
>>> then you should frame it suitably in the public header and of course
>>> you need to enforce this here (which would e.g. mean you shouldn't
>>> use rcu_lock_domain_by_any_id()).
>>>
>> 
>> What should I use instead of rcu_lock_domain_by_and_id()?
> 
> Please take a look at the header where its declaration lives. It's
> admittedly not the usual thing in Xen, but there are even comments
> describing the differences between the four related by-id functions.
> I guess rcu_lock_live_remote_domain_by_id() is the one you want to
> use, despite being puzzled by there being surprisingly little uses
> elsewhere.
> 

Ok, I will correct this.

>>> Also please take a look at hvm/ioreq.c, which makes quite a bit of
>>> use of domain_pause(). In particular I think you want to acquire
>>> the lock only after having paused the domain.
>> 
>> This domain_pause() will be changed to vcpu_pause().
> 
> And you understand that my comment then still applies?

If you mean that we should first call vcpu_pause()
and then acquire spinlock, then yes, this will be corrected in v3.

>>> Is any of what you do in this switch() actually legitimate without
>>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>>> remarks elsewhere I imply you expect the param that you currently
>>> use to be set upon domain creation time, but at the very least the
>>> potentially big buffer should imo not get allocated up front, but
>>> only when tracing is to actually be enabled.
>> 
>> Wait... so you want to allocate these buffers in runtime?
>> Previously we were talking that there is too much runtime logic
>> and these enable/disable hypercalls should be stripped to absolute
>> minimum.
> 
> Basic arrangements can be made at domain creation time. I don't
> think though that it would be a good use of memory if you
> allocated perhaps many gigabytes of memory just for possibly
> wanting to enable tracing on a guest.
> 

>From our previous conversations I thought that you want to have
as much logic moved to the domain creation as possible.

Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
zero (= disabled), if you set it to a non-zero value, then trace buffers
of given size will be allocated for the domain and you have possibility
to use ipt_enable/ipt_disable at any moment.

This way the runtime logic is as thin as possible. I assume user knows
in advance whether he/she would want to use external monitoring with IPT
or not.

It's also not "many gigabytes". In most use cases a buffer of 16/32/64 MB
would suffice, I think.

If we want to fall back to the scenario where the trace buffer is
allocated dynamically, then we basically get back to patch v1
implementation.

 +struct xen_hvm_vmtrace_op {
 +/* IN variable */
 +uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
 +uint32_t cmd;
 +/* Enable/disable external vmtrace for given domain */
 +#define HVMOP_vmtrace_ipt_enable  1
 +#define HVMOP_vmtrace_ipt_disable 2
 +#define HVMOP_vmtrace_ipt_get_offset  3
 +domid_t domain;
 +uint32_t vcpu;
 +uint64_t size;
 +
 +/* OUT variable */
 +uint64_t offset;
>>>
>>> If this is to be a tools-only interface, please use uint64_aligned_t.
>>>
>> 
>> This type is not defined within hvm_op.h header. What should I do about it?
> 
> It gets defined by xen.h, so should be available here. Its
> definitions live in a
> 
> #if defined(__XEN__) || defined(__XEN_TOOLS__)
> 
> section, which is what I did recommend to put your interface in
> as well. Unless you want this to be exposed to the guest itself,
> at which point further constraints would arise.
> 
>>> You also want to add an entry to xen/include/xlat.lst and use the
>>> resulting macro to prove that the struct layout is the same for
>>> native and compat callers.
>> 
>> Could you tell a little bit more about this? What are "native" and
>> "compat" callers and what is the purpose of this file?
> 
> A native caller is one whose bitness matches Xen's, i.e. for x86
> a guest running in 64-bit mode. A compat guest is one running in
> 32-bit mode. Interface structure layout, at least for historical
> reasons, can differ between the two. If it does, these
> structures need translation. In the case here the layouts look
> to match, which we still want to be verified at build time. If
> you add a suitable line to xlat.lst starting with a ?, a macro
> will be generated that you can then invoke somewhere near th

Re: Event delivery and "domain blocking" on PVHv2

2020-06-22 Thread Roger Pau Monné
On Mon, Jun 22, 2020 at 05:31:00PM +0200, Martin Lucina wrote:
> On 2020-06-22 15:58, Roger Pau Monné wrote:
> > On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
> > > Aha! Thank you for pointing this out. I think you may be right, but
> > > this
> > > should be possible without doing the demuxing in interrupt context.
> > 
> > If you don't do the demuxing in the interrupt context (ie: making the
> > interrupt handler a noop), then you don't likely need such interrupt
> > anyway?
> 
> I need the/an interrupt to wake the VCPU from HLT state if we went to sleep.
> 
> > 
> > > How about this arrangement, which appears to work for me; no hangs I
> > > can see
> > > so far and domU survives ping -f fine with no packet loss:
> > > 
> > > CAMLprim value
> > > mirage_xen_evtchn_block_domain(value v_deadline)
> > > {
> > > struct vcpu_info *vi = VCPU0_INFO();
> > > solo5_time_t deadline = Int64_val(v_deadline);
> > > 
> > > if (solo5_clock_monotonic() < deadline) {
> > > __asm__ __volatile__ ("cli" : : : "memory");
> > > if (vi->evtchn_upcall_pending) {
> > > __asm__ __volatile__ ("sti");
> > > }
> > > else {
> > > hypercall_set_timer_op(deadline);
> > 
> > What if you set a deadline so close that evtchn_upcall_pending gets
> > set by Xen here and the interrupt is injected? You would execute the
> > noop handler and just hlt, and could likely end up in the same blocked
> > situation as before?
> 
> Why would an interrupt be injected here? Doesn't the immediately preceding
> "cli" disable that?

Well, I mean between the sti and the hlt instruction. I think there's
always a window for a race here, and that's the reason for having
SCHEDOP_block (see comment referring to avoiding a "wakeup waiting"
race).

> Or perhaps I need to do a PV/HVM hybrid and set vi->evtchn_upcall_mask just
> before the cli, and clear it after the sti?

I think SCHEDOP_block is broken on HVM, as as Jan points out
evtchn_upcall_mask is not used on HVM (we should really have a comment
about this in xen.h). So that hypercall is no longer useful to avoid
wakeup waiting races.

> > > i.e. Always go to sleep with interrupts disabled, but before doing so
> > > re-check that no events have become pending since the last time
> > > evtchn_demux_pending() was called. This holds, since the only thing
> > > that
> > > sets vi->evtchn_upcall_pending is Xen, and the only thing that
> > > clears it is
> > > evtchn_demux_pending().
> > > 
> > > Right?
> > 
> > TBH this is a hard model to get right, I think your best bet at
> > attempting something along this lines is to forget about using the
> > event channel interrupt and instead use SCHEDOP_poll. You could do
> > something like (written in pure C as I have no idea of the ocaml
> > bindings):
> > [SCHEDOP_poll code snipped]
> 
> Thanks for the suggestion. This brings us full-circle -- I found [1] and
> [2] way back from 2013 when Mirage/Xen was initially using SCHEDOP_poll
> and then switched to the current interrupts + SCHEDOP_block approach.
> 
> Part of the motivation for the change at the time was to allow waiting
> on/servicing more than 128 ports (the limit for SCHEDOP_poll). I doubt
> anyone wants to do that these days, but it still makes me a bit reluctant
> to change back to SCHEDOP_poll.

Right, Mirage/Xen being single processor it's not likely to use more
than 128 event channels, but I can understand the desire to not be
limited by that amount.

> > > In an attempt to understand why the original PV code worked I
> > > re-read the PV
> > > Mini-OS block_domain code again and realised that I had entirely
> > > missed one
> > > part of its behaviour, which is that it intends[*] to run with
> > > interrupts/upcalls disabled *all* the time and relies on SCHEDOP_block
> > > atomically re-enabling them and triggering an upcall before
> > > returning (PV)
> > > or "briefly enabling interrupts to allow handlers to run" (HVM).
> > > We're doing
> > > the inverse, but our behaviour matches my mental model of how things
> > > should
> > > work.
> > 
> > Not really IMO, as SCHEDOP_block is a single 'instruction' from a
> > guest PoV that does the enabling of interrupts and returns if there
> > are pending ones.
> 
> Indeed and this is exactly the operation I need in the HVM world with the
> current model.

Another option would be to consider re-purposing SCHEDOP_block for HVM
so it does a sti on behalf of the guest and then checks for pending
interrupts to inject.

> > Also SCHEDOP_block is not exactly the same on HVM, as it just checks
> > for pending vectors to inject, but not for pending event channels.
> 
> ... well, I don't want to use SCHEDOP_block anyway since that is not
> possible
> on ARM, and I would like to keep the code at least "portable with not too
> much effort". So there should be a non-racy "HVM way" to do this?

One solution (albeit it would be slightly crappy IMO) is to make sure
the timer is always fully han

Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Jan Beulich
On 22.06.2020 18:02, Michał Leszczyński wrote:
> - 22 cze 2020 o 17:22, Jan Beulich jbeul...@suse.com napisał(a):
>> On 22.06.2020 16:35, Michał Leszczyński wrote:
>>> - 22 cze 2020 o 15:25, Jan Beulich jbeul...@suse.com napisał(a):
 Is any of what you do in this switch() actually legitimate without
 hvm_set_vmtrace_pt_size() having got called for the guest? From
 remarks elsewhere I imply you expect the param that you currently
 use to be set upon domain creation time, but at the very least the
 potentially big buffer should imo not get allocated up front, but
 only when tracing is to actually be enabled.
>>>
>>> Wait... so you want to allocate these buffers in runtime?
>>> Previously we were talking that there is too much runtime logic
>>> and these enable/disable hypercalls should be stripped to absolute
>>> minimum.
>>
>> Basic arrangements can be made at domain creation time. I don't
>> think though that it would be a good use of memory if you
>> allocated perhaps many gigabytes of memory just for possibly
>> wanting to enable tracing on a guest.
>>
> 
> From our previous conversations I thought that you want to have
> as much logic moved to the domain creation as possible.
> 
> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
> zero (= disabled), if you set it to a non-zero value, then trace buffers
> of given size will be allocated for the domain and you have possibility
> to use ipt_enable/ipt_disable at any moment.
> 
> This way the runtime logic is as thin as possible. I assume user knows
> in advance whether he/she would want to use external monitoring with IPT
> or not.

Andrew - I think you requested movement to domain_create(). Could
you clarify if indeed you mean to also allocate the big buffers
this early?

> It's also not "many gigabytes". In most use cases a buffer of 16/32/64 MB
> would suffice, I think.

But that one such buffer per vCPU, isn't it? Plus these buffers
need to be physically contiguous, which is an additional possibly
severe constraint.

Jan



Re: Event delivery and "domain blocking" on PVHv2

2020-06-22 Thread Jan Beulich
On 22.06.2020 18:09, Roger Pau Monné wrote:
> On Mon, Jun 22, 2020 at 05:31:00PM +0200, Martin Lucina wrote:
>> On 2020-06-22 15:58, Roger Pau Monné wrote:
>>> On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
 Aha! Thank you for pointing this out. I think you may be right, but
 this
 should be possible without doing the demuxing in interrupt context.
>>>
>>> If you don't do the demuxing in the interrupt context (ie: making the
>>> interrupt handler a noop), then you don't likely need such interrupt
>>> anyway?
>>
>> I need the/an interrupt to wake the VCPU from HLT state if we went to sleep.
>>
>>>
 How about this arrangement, which appears to work for me; no hangs I
 can see
 so far and domU survives ping -f fine with no packet loss:

 CAMLprim value
 mirage_xen_evtchn_block_domain(value v_deadline)
 {
 struct vcpu_info *vi = VCPU0_INFO();
 solo5_time_t deadline = Int64_val(v_deadline);

 if (solo5_clock_monotonic() < deadline) {
 __asm__ __volatile__ ("cli" : : : "memory");
 if (vi->evtchn_upcall_pending) {
 __asm__ __volatile__ ("sti");
 }
 else {
 hypercall_set_timer_op(deadline);
>>>
>>> What if you set a deadline so close that evtchn_upcall_pending gets
>>> set by Xen here and the interrupt is injected? You would execute the
>>> noop handler and just hlt, and could likely end up in the same blocked
>>> situation as before?
>>
>> Why would an interrupt be injected here? Doesn't the immediately preceding
>> "cli" disable that?
> 
> Well, I mean between the sti and the hlt instruction.

When EFLAGS.IF was clear before STI, then the first point at which
an interrupt can get injected is when HLT is already executed (i.e.
to wake from this HLT). That's the so called "STI shadow".

Jan



Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Michał Leszczyński
- 22 cze 2020 o 18:16, Jan Beulich jbeul...@suse.com napisał(a):

> On 22.06.2020 18:02, Michał Leszczyński wrote:
>> - 22 cze 2020 o 17:22, Jan Beulich jbeul...@suse.com napisał(a):
>>> On 22.06.2020 16:35, Michał Leszczyński wrote:
 - 22 cze 2020 o 15:25, Jan Beulich jbeul...@suse.com napisał(a):
> Is any of what you do in this switch() actually legitimate without
> hvm_set_vmtrace_pt_size() having got called for the guest? From
> remarks elsewhere I imply you expect the param that you currently
> use to be set upon domain creation time, but at the very least the
> potentially big buffer should imo not get allocated up front, but
> only when tracing is to actually be enabled.

 Wait... so you want to allocate these buffers in runtime?
 Previously we were talking that there is too much runtime logic
 and these enable/disable hypercalls should be stripped to absolute
 minimum.
>>>
>>> Basic arrangements can be made at domain creation time. I don't
>>> think though that it would be a good use of memory if you
>>> allocated perhaps many gigabytes of memory just for possibly
>>> wanting to enable tracing on a guest.
>>>
>> 
>> From our previous conversations I thought that you want to have
>> as much logic moved to the domain creation as possible.
>> 
>> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
>> zero (= disabled), if you set it to a non-zero value, then trace buffers
>> of given size will be allocated for the domain and you have possibility
>> to use ipt_enable/ipt_disable at any moment.
>> 
>> This way the runtime logic is as thin as possible. I assume user knows
>> in advance whether he/she would want to use external monitoring with IPT
>> or not.
> 
> Andrew - I think you requested movement to domain_create(). Could
> you clarify if indeed you mean to also allocate the big buffers
> this early?
> 
>> It's also not "many gigabytes". In most use cases a buffer of 16/32/64 MB
>> would suffice, I think.
> 
> But that one such buffer per vCPU, isn't it? Plus these buffers
> need to be physically contiguous, which is an additional possibly
> severe constraint.

Yes. For my use case (VMI stuff) I estimate 16-64 MB per vCPU and for fuzzing
I think it would be even less.

And also yes - these buffers need to be physically contigous and aligned
because otherwise CPU would refuse to use them.


Best regards,
Michał Leszczyński
CERT Polska




Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Roger Pau Monné
On Mon, Jun 22, 2020 at 06:16:57PM +0200, Jan Beulich wrote:
> On 22.06.2020 18:02, Michał Leszczyński wrote:
> > - 22 cze 2020 o 17:22, Jan Beulich jbeul...@suse.com napisał(a):
> >> On 22.06.2020 16:35, Michał Leszczyński wrote:
> >>> - 22 cze 2020 o 15:25, Jan Beulich jbeul...@suse.com napisał(a):
> > It's also not "many gigabytes". In most use cases a buffer of 16/32/64 MB
> > would suffice, I think.
> 
> But that one such buffer per vCPU, isn't it? Plus these buffers
> need to be physically contiguous, which is an additional possibly
> severe constraint.

FTR, from my reading of the SDM you can use a mode called ToPA where
the buffer is some kind of linked list of tables that map to output
regions. That would be nice, but IMO it should be implemented in a
next iteration after the basic support is in.

Roger.



Re: Event delivery and "domain blocking" on PVHv2

2020-06-22 Thread Martin Lucina

On 2020-06-22 18:20, Jan Beulich wrote:

On 22.06.2020 18:09, Roger Pau Monné wrote:

On Mon, Jun 22, 2020 at 05:31:00PM +0200, Martin Lucina wrote:

On 2020-06-22 15:58, Roger Pau Monné wrote:

On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:

Aha! Thank you for pointing this out. I think you may be right, but
this
should be possible without doing the demuxing in interrupt context.


If you don't do the demuxing in the interrupt context (ie: making 
the

interrupt handler a noop), then you don't likely need such interrupt
anyway?


I need the/an interrupt to wake the VCPU from HLT state if we went to 
sleep.




How about this arrangement, which appears to work for me; no hangs 
I

can see
so far and domU survives ping -f fine with no packet loss:

CAMLprim value
mirage_xen_evtchn_block_domain(value v_deadline)
{
struct vcpu_info *vi = VCPU0_INFO();
solo5_time_t deadline = Int64_val(v_deadline);

if (solo5_clock_monotonic() < deadline) {
__asm__ __volatile__ ("cli" : : : "memory");
if (vi->evtchn_upcall_pending) {
__asm__ __volatile__ ("sti");
}
else {
hypercall_set_timer_op(deadline);


What if you set a deadline so close that evtchn_upcall_pending gets
set by Xen here and the interrupt is injected? You would execute the
noop handler and just hlt, and could likely end up in the same 
blocked

situation as before?


Why would an interrupt be injected here? Doesn't the immediately 
preceding

"cli" disable that?


Well, I mean between the sti and the hlt instruction.


When EFLAGS.IF was clear before STI, then the first point at which
an interrupt can get injected is when HLT is already executed (i.e.
to wake from this HLT). That's the so called "STI shadow".


Indeed, that's what the Intel SDM says, and Andrew already mentioned 
earlier in this thread in a different context, here: 
https://lists.xenproject.org/archives/html/mirageos-devel/2020-06/msg00021.html

.

So it would seem that my latest approach is race-free?

Martin



Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Michał Leszczyński
- 22 cze 2020 o 18:25, Roger Pau Monné roger@citrix.com napisał(a):

> On Mon, Jun 22, 2020 at 06:16:57PM +0200, Jan Beulich wrote:
>> On 22.06.2020 18:02, Michał Leszczyński wrote:
>> > - 22 cze 2020 o 17:22, Jan Beulich jbeul...@suse.com napisał(a):
>> >> On 22.06.2020 16:35, Michał Leszczyński wrote:
>> >>> - 22 cze 2020 o 15:25, Jan Beulich jbeul...@suse.com napisał(a):
>> > It's also not "many gigabytes". In most use cases a buffer of 16/32/64 MB
>> > would suffice, I think.
>> 
>> But that one such buffer per vCPU, isn't it? Plus these buffers
>> need to be physically contiguous, which is an additional possibly
>> severe constraint.
> 
> FTR, from my reading of the SDM you can use a mode called ToPA where
> the buffer is some kind of linked list of tables that map to output
> regions. That would be nice, but IMO it should be implemented in a
> next iteration after the basic support is in.
> 
> Roger.

Yes. I keep that in mind but right now I would like to go for the
minimum viable implementation, while ToPA could be added in the next
patch series.


Best regards,
Michał Leszczyński
CERT Polska



Re: Event delivery and "domain blocking" on PVHv2

2020-06-22 Thread Roger Pau Monné
On Mon, Jun 22, 2020 at 06:20:34PM +0200, Jan Beulich wrote:
> On 22.06.2020 18:09, Roger Pau Monné wrote:
> > On Mon, Jun 22, 2020 at 05:31:00PM +0200, Martin Lucina wrote:
> >> On 2020-06-22 15:58, Roger Pau Monné wrote:
> >>> On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
>  Aha! Thank you for pointing this out. I think you may be right, but
>  this
>  should be possible without doing the demuxing in interrupt context.
> >>>
> >>> If you don't do the demuxing in the interrupt context (ie: making the
> >>> interrupt handler a noop), then you don't likely need such interrupt
> >>> anyway?
> >>
> >> I need the/an interrupt to wake the VCPU from HLT state if we went to 
> >> sleep.
> >>
> >>>
>  How about this arrangement, which appears to work for me; no hangs I
>  can see
>  so far and domU survives ping -f fine with no packet loss:
> 
>  CAMLprim value
>  mirage_xen_evtchn_block_domain(value v_deadline)
>  {
>  struct vcpu_info *vi = VCPU0_INFO();
>  solo5_time_t deadline = Int64_val(v_deadline);
> 
>  if (solo5_clock_monotonic() < deadline) {
>  __asm__ __volatile__ ("cli" : : : "memory");
>  if (vi->evtchn_upcall_pending) {
>  __asm__ __volatile__ ("sti");
>  }
>  else {
>  hypercall_set_timer_op(deadline);
> >>>
> >>> What if you set a deadline so close that evtchn_upcall_pending gets
> >>> set by Xen here and the interrupt is injected? You would execute the
> >>> noop handler and just hlt, and could likely end up in the same blocked
> >>> situation as before?
> >>
> >> Why would an interrupt be injected here? Doesn't the immediately preceding
> >> "cli" disable that?
> > 
> > Well, I mean between the sti and the hlt instruction.
> 
> When EFLAGS.IF was clear before STI, then the first point at which
> an interrupt can get injected is when HLT is already executed (i.e.
> to wake from this HLT). That's the so called "STI shadow".

Oh, so then what Martin does seems to be fine, as there's no race, by
the fact that evtchn_upcall_pending is checked with interrupts
disabled.

Thanks, Roger.



[xen-4.11-testing test] 151279: regressions - FAIL

2020-06-22 Thread osstest service owner
flight 151279 xen-4.11-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/151279/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-amd64-prev  6 xen-build  fail in 151260 REGR. vs. 150040
 build-i386-prev   6 xen-build  fail in 151260 REGR. vs. 150040

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 10 debian-hvm-install fail pass 
in 151260
 test-armhf-armhf-xl-rtds 12 guest-startfail pass in 151260

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-migrupgrade  1 build-check(1)   blocked in 151260 n/a
 test-amd64-i386-migrupgrade   1 build-check(1)   blocked in 151260 n/a
 test-armhf-armhf-xl-rtds13 migrate-support-check fail in 151260 never pass
 test-armhf-armhf-xl-rtds 14 saverestore-support-check fail in 151260 never pass
 test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrict 10 debian-hvm-install 
fail never pass
 test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict 10 debian-hvm-install 
fail never pass
 test-amd64-i386-xl-pvshim12 guest-start  fail   never pass
 test-amd64-i386-libvirt-xsm  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  14 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt 13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-arm64-arm64-xl-xsm  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 14 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  14 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-vhd 12 migrate-support-checkfail   never pass
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop  fail never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 14 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-checkfail   never pass
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop fail never pass
 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop fail never pass
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop fail never pass
 test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop  fail never pass
 test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2  fail never pass
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop  fail never pass
 test-armhf-armhf-xl  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 12 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 13 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  13 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 13 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt 14 saverestore-support-checkfail   never pass
 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop

Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Michał Leszczyński
> +struct xen_hvm_vmtrace_op {
> +/* IN variable */
> +uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
> +uint32_t cmd;
> +/* Enable/disable external vmtrace for given domain */
> +#define HVMOP_vmtrace_ipt_enable  1
> +#define HVMOP_vmtrace_ipt_disable 2
> +#define HVMOP_vmtrace_ipt_get_offset  3
> +domid_t domain;
> +uint32_t vcpu;
> +uint64_t size;
> +
> +/* OUT variable */
> +uint64_t offset;

 If this is to be a tools-only interface, please use uint64_aligned_t.

>>> 
>>> This type is not defined within hvm_op.h header. What should I do about it?
>> 
>> It gets defined by xen.h, so should be available here. Its
>> definitions live in a
>> 
>> #if defined(__XEN__) || defined(__XEN_TOOLS__)
>> 
>> section, which is what I did recommend to put your interface in
>> as well. Unless you want this to be exposed to the guest itself,
>> at which point further constraints would arise.
>> 

When I've putted it into #if defined(__XEN__) || defined(__XEN_TOOLS__)
then it complains about uint64_aligned_compat_t type missing.

I also can't spot any single instance of uint64_aligned_t within
this file.


ml



Re: [PATCH for-4.14] x86/msr: Disallow access to Processor Trace MSRs

2020-06-22 Thread Andrew Cooper
On 19/06/2020 14:39, Jan Beulich wrote:
> On 19.06.2020 15:28, Andrew Cooper wrote:
>> On 19/06/2020 13:48, Jan Beulich wrote:
>>> On 19.06.2020 13:58, Andrew Cooper wrote:
 --- a/xen/arch/x86/msr.c
 +++ b/xen/arch/x86/msr.c
 @@ -168,6 +168,12 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, 
 uint64_t *val)
  case MSR_TSX_FORCE_ABORT:
  case MSR_TSX_CTRL:
  case MSR_MCU_OPT_CTRL:
 +case MSR_RTIT_OUTPUT_BASE:
 +case MSR_RTIT_OUTPUT_MASK:
 +case MSR_RTIT_CTL:
 +case MSR_RTIT_STATUS:
 +case MSR_RTIT_CR3_MATCH:
 +case MSR_RTIT_ADDR_A(0) ... MSR_RTIT_ADDR_B(3):
>>> The respective CPUID field is 3 bits wide, so wouldn't it be better
>>> to cover the full possible range (0...6 afaict)?
>> Last time I tried, you objected to me covering MSR ranges which weren't
>> defined.
> Wasn't that for a range where some number had been re-used from
> earlier models (with entirely different purpose)?

I don't recall, but the answer isn't relevant to whether the MSRs at
those indices ought to be available to guests.

>> If you want to extend the range like that, it ought to be
>> MSR_RTIT_OUTPUT_BASE ... MSR_RTIT_ADDR_B(7) to cover the entire area
>> which seems to be exclusively for PT.
> I'd be okay with that, just that I'm not sure about (7): While
> SDM Vol 2 oddly enough doesn't seem to list anything for leaf 7
> subleaf 1 (or I'm sufficiently blind today), Vol 4 clearly says
> for n=0...3 "If CPUID.(EAX=07H,ECX=1):EAX[2:0] > ", and the
> field obviously can't be > 7.

7 gives the top of the bank of MSRs.  It isn't related to CPUID data.

~Andrew



Re: UEFI support in ARM DomUs

2020-06-22 Thread Julien Grall




On 22/06/2020 15:33, Oleksandr Andrushchenko wrote:


On 6/22/20 5:27 PM, Julien Grall wrote:

Hi Oleksandr,

On 22/06/2020 15:04, Oleksandr Andrushchenko wrote:

On 6/19/20 11:02 PM, Stefano Stabellini wrote:

On Thu, 18 Jun 2020, Julien Grall wrote:

ifeq ($(CONFIG_XEN),y)
arch-y += -D__XEN_INTERFACE_VERSION__=0x00040d00
endif

and we also have Xen 4.13 headers in the U-boot tree.


Sorry if this was already asked before. Why do you need to specify 
__XEN_INTERFACE_VERSION__?


This is because of include/xen/interface/xen-compat.h:

#if defined(__XEN__) || defined(__XEN_TOOLS__)

/* Xen is built with matching headers and implements the latest interface. */
#define __XEN_INTERFACE_VERSION__ __XEN_LATEST_INTERFACE_VERSION__
#elif !defined(__XEN_INTERFACE_VERSION__)
/* Guests which do not specify a version get the legacy interface. */
#define __XEN_INTERFACE_VERSION__ 0x
#endif


I am afraid this doesn't explain why you need to define it to a specific 
version.


__XEN_INTERFACE_VERSION__ is really mostly here to allow a guest to 
build if they rely on header from xen.git (we may have done some 
renaming). Most (if not all) the interfaces you care ought to be stable.


Older Xen will return -ENOSYS/-EOPNOTSUPP should deny any values they 
don't know.


As you pull the headers in U-boot, you could safely define 
__XEN_INTERFACE_VERSION__ as __XEN_LATEST_INTERFACE_VERSION__. FWIW, 
this is what Linux is doing to some extend.


Note that I haven't suggested to keep __XEN_INTERFACE_VERSION__ as 
0x because it looks like it is going to do the wrong thing on 
arm32 :(.


I have at least spot one issue with GNTTABOP_setup_table where it will 
use unsigned long (i.e 32-bit) for older interface. But the hypervisor 
will always 64-bits.


This probably going to result to some overwrite. I think we should fix 
the headers to force it to use xen_pfn_t for all Xen on Arm version.


Something like:

#if !(defined(__arch64__) || defined(__arm__)) && 
__XEN_INTERFACE_VERSION__ < 0x00040300

XEN_GUEST_HANDLE(ulong) frame_list;
#else
XEN_GUEST_HANDLE(xen_pfn_t) frame_list;
#endif



So, one needs to specify the version (QEMU does that via its configure script 
after

some tests)


Well libxc is definitely not stable, hence why QEMU requires to specify 
the version. But the interface with the guest is always meant to be stable.






For the first part (__XEN_INTERFACE_VERSION__) I think we can provide it via

CFLAGS or something. This can also be done for the location of Xen headers.


__XEN_INTERFACE_VERSION__ should work through the CFLAGS. An alternative would 
be to allow the user to specify through the Kconfig.


You mean specifying via Kconfig something like "0x00040d00"?


Possibly yes.



And what about the headers? How will we provide their location if we decide not 
to include those

in the tree?


I would do through Kconfig as well.

Cheers,

--
Julien Grall



[PATCH v3 0/7] Implement support for external IPT monitoring

2020-06-22 Thread Michał Leszczyński
Intel Processor Trace is an architectural extension available in modern Intel 
family CPUs. It allows recording the detailed trace of activity while the 
processor executes the code. One might use the recorded trace to reconstruct 
the code flow. It means, to find out the executed code paths, determine 
branches taken, and so forth.

The abovementioned feature is described in Intel(R) 64 and IA-32 Architectures 
Software Developer's Manual Volume 3C: System Programming Guide, Part 3, 
Chapter 36: "Intel Processor Trace."

This patch series implements an interface that Dom0 could use in order to 
enable IPT for particular vCPUs in DomU, allowing for external monitoring. Such 
a feature has numerous applications like malware monitoring, fuzzing, or 
performance testing.

Also thanks to Tamas K Lengyel for a few preliminary hints before
first version of this patch was submitted to xen-devel.

Changed since v1:
  * MSR_RTIT_CTL is managed using MSR load lists
  * other PT-related MSRs are modified only when vCPU goes out of context
  * trace buffer is now acquired as a resource
  * added vmtrace_pt_size parameter in xl.cfg, the size of trace buffer
must be specified in the moment of domain creation
  * trace buffers are allocated on domain creation, destructed on
domain destruction
  * HVMOP_vmtrace_ipt_enable/disable is limited to enabling/disabling PT
these calls don't manage buffer memory anymore
  * lifted 32 MFN/GFN array limit when acquiring resources
  * minor code style changes according to review

Changed since v2:
  * trace buffer is now allocated on domain creation (in v2 it was
allocated when hvm param was set)
  * restored 32-item limit in mfn/gfn arrays in acquire_resource
and instead implemented hypercall continuations
  * code changes according to Jan's and Roger's review

Michal Leszczynski (7):
  memory: batch processing in acquire_resource()
  x86/vmx: add Intel PT MSR definitions
  x86/vmx: add IPT cpu feature
  x86/vmx: add do_vmtrace_op
  tools/libxc: add xc_vmtrace_* functions
  tools/libxl: add vmtrace_pt_size parameter
  tools/proctrace: add proctrace tool

 docs/man/xl.cfg.5.pod.in|  10 +
 tools/golang/xenlight/helpers.gen.go|   2 +
 tools/golang/xenlight/types.gen.go  |   1 +
 tools/libxc/Makefile|   1 +
 tools/libxc/include/xenctrl.h   |  39 +++
 tools/libxc/xc_vmtrace.c|  94 ++
 tools/libxl/libxl_create.c  |   1 +
 tools/libxl/libxl_types.idl |   2 +
 tools/proctrace/COPYING | 339 
 tools/proctrace/Makefile|  50 +++
 tools/proctrace/proctrace.c | 158 +
 tools/xl/xl_parse.c |   4 +
 xen/arch/x86/hvm/hvm.c  | 168 ++
 xen/arch/x86/hvm/vmx/vmcs.c |   7 +-
 xen/arch/x86/hvm/vmx/vmx.c  |  31 ++
 xen/arch/x86/mm.c   |  28 ++
 xen/common/domain.c |   3 +
 xen/common/memory.c |  32 +-
 xen/include/asm-x86/cpufeature.h|   1 +
 xen/include/asm-x86/hvm/hvm.h   |   9 +
 xen/include/asm-x86/hvm/vmx/vmcs.h  |   4 +
 xen/include/asm-x86/hvm/vmx/vmx.h   |  14 +
 xen/include/asm-x86/msr-index.h |  37 +++
 xen/include/public/arch-x86/cpufeatureset.h |   1 +
 xen/include/public/domctl.h |   1 +
 xen/include/public/hvm/hvm_op.h |  26 ++
 xen/include/public/hvm/params.h |   2 +-
 xen/include/public/memory.h |   1 +
 xen/include/xen/sched.h |   4 +
 xen/include/xlat.lst|   1 +
 30 files changed, 1066 insertions(+), 5 deletions(-)
 create mode 100644 tools/libxc/xc_vmtrace.c
 create mode 100644 tools/proctrace/COPYING
 create mode 100644 tools/proctrace/Makefile
 create mode 100644 tools/proctrace/proctrace.c

-- 
2.20.1




[PATCH v3 1/7] memory: batch processing in acquire_resource()

2020-06-22 Thread Michał Leszczyński
Allow to acquire large resources by allowing acquire_resource()
to process items in batches, using hypercall continuation.

Signed-off-by: Michal Leszczynski 
---
 xen/common/memory.c | 32 +---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index 714077c1e5..3ab06581a2 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1046,10 +1046,12 @@ static int acquire_grant_table(struct domain *d, 
unsigned int id,
 }
 
 static int acquire_resource(
-XEN_GUEST_HANDLE_PARAM(xen_mem_acquire_resource_t) arg)
+XEN_GUEST_HANDLE_PARAM(xen_mem_acquire_resource_t) arg,
+unsigned long *start_extent)
 {
 struct domain *d, *currd = current->domain;
 xen_mem_acquire_resource_t xmar;
+uint32_t total_frames;
 /*
  * The mfn_list and gfn_list (below) arrays are ok on stack for the
  * moment since they are small, but if they need to grow in future
@@ -1077,8 +1079,17 @@ static int acquire_resource(
 return 0;
 }
 
+total_frames = xmar.nr_frames;
+
+if ( *start_extent )
+{
+xmar.frame += *start_extent;
+xmar.nr_frames -= *start_extent;
+guest_handle_add_offset(xmar.frame_list, *start_extent);
+}
+
 if ( xmar.nr_frames > ARRAY_SIZE(mfn_list) )
-return -E2BIG;
+xmar.nr_frames = ARRAY_SIZE(mfn_list);
 
 rc = rcu_lock_remote_domain_by_id(xmar.domid, &d);
 if ( rc )
@@ -1135,6 +1146,14 @@ static int acquire_resource(
 }
 }
 
+if ( !rc )
+{
+*start_extent += xmar.nr_frames;
+
+if ( *start_extent != total_frames )
+rc = -ERESTART;
+}
+
  out:
 rcu_unlock_domain(d);
 
@@ -1600,7 +1619,14 @@ long do_memory_op(unsigned long cmd, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 
 case XENMEM_acquire_resource:
 rc = acquire_resource(
-guest_handle_cast(arg, xen_mem_acquire_resource_t));
+guest_handle_cast(arg, xen_mem_acquire_resource_t),
+&start_extent);
+
+if ( rc == -ERESTART )
+return hypercall_create_continuation(
+__HYPERVISOR_memory_op, "lh",
+op | (start_extent << MEMOP_EXTENT_SHIFT), arg);
+
 break;
 
 default:
-- 
2.20.1




[PATCH v3 3/7] x86/vmx: add IPT cpu feature

2020-06-22 Thread Michał Leszczyński
Check if Intel Processor Trace feature is supported by current
processor. Define hvm_ipt_supported function.

Signed-off-by: Michal Leszczynski 
---
 xen/arch/x86/hvm/vmx/vmcs.c | 7 ++-
 xen/include/asm-x86/cpufeature.h| 1 +
 xen/include/asm-x86/hvm/hvm.h   | 9 +
 xen/include/asm-x86/hvm/vmx/vmcs.h  | 1 +
 xen/include/public/arch-x86/cpufeatureset.h | 1 +
 5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index ca94c2bedc..8c78c906b2 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -291,6 +291,12 @@ static int vmx_init_vmcs_config(void)
 _vmx_cpu_based_exec_control &=
 ~(CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING);
 
+rdmsrl(MSR_IA32_VMX_MISC, _vmx_misc_cap);
+
+/* Check whether IPT is supported in VMX operation. */
+hvm_funcs.pt_supported = cpu_has_ipt &&
+ (_vmx_misc_cap & VMX_MISC_PT_SUPPORTED);
+
 if ( _vmx_cpu_based_exec_control & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS )
 {
 min = 0;
@@ -305,7 +311,6 @@ static int vmx_init_vmcs_config(void)
SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS |
SECONDARY_EXEC_XSAVES |
SECONDARY_EXEC_TSC_SCALING);
-rdmsrl(MSR_IA32_VMX_MISC, _vmx_misc_cap);
 if ( _vmx_misc_cap & VMX_MISC_VMWRITE_ALL )
 opt |= SECONDARY_EXEC_ENABLE_VMCS_SHADOWING;
 if ( opt_vpid_enabled )
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index f790d5c1f8..8d7955dd87 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -104,6 +104,7 @@
 #define cpu_has_clwbboot_cpu_has(X86_FEATURE_CLWB)
 #define cpu_has_avx512erboot_cpu_has(X86_FEATURE_AVX512ER)
 #define cpu_has_avx512cdboot_cpu_has(X86_FEATURE_AVX512CD)
+#define cpu_has_ipt boot_cpu_has(X86_FEATURE_IPT)
 #define cpu_has_sha boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bwboot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vlboot_cpu_has(X86_FEATURE_AVX512VL)
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 1eb377dd82..8c0d0ece67 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -96,6 +96,9 @@ struct hvm_function_table {
 /* Necessary hardware support for alternate p2m's? */
 bool altp2m_supported;
 
+/* Hardware support for processor tracing? */
+bool pt_supported;
+
 /* Hardware virtual interrupt delivery enable? */
 bool virtual_intr_delivery_enabled;
 
@@ -630,6 +633,12 @@ static inline bool hvm_altp2m_supported(void)
 return hvm_funcs.altp2m_supported;
 }
 
+/* returns true if hardware supports Intel Processor Trace */
+static inline bool hvm_pt_supported(void)
+{
+return hvm_funcs.pt_supported;
+}
+
 /* updates the current hardware p2m */
 static inline void altp2m_vcpu_update_p2m(struct vcpu *v)
 {
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h 
b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 906810592f..0e9a0b8de6 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -283,6 +283,7 @@ extern u32 vmx_secondary_exec_control;
 #define VMX_VPID_INVVPID_SINGLE_CONTEXT_RETAINING_GLOBAL 0x800ULL
 extern u64 vmx_ept_vpid_cap;
 
+#define VMX_MISC_PT_SUPPORTED   0x4000
 #define VMX_MISC_CR3_TARGET 0x01ff
 #define VMX_MISC_VMWRITE_ALL0x2000
 
diff --git a/xen/include/public/arch-x86/cpufeatureset.h 
b/xen/include/public/arch-x86/cpufeatureset.h
index 5ca35d9d97..0d3f15f628 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -217,6 +217,7 @@ XEN_CPUFEATURE(SMAP,  5*32+20) /*S  Supervisor Mode 
Access Prevention */
 XEN_CPUFEATURE(AVX512_IFMA,   5*32+21) /*A  AVX-512 Integer Fused Multiply Add 
*/
 XEN_CPUFEATURE(CLFLUSHOPT,5*32+23) /*A  CLFLUSHOPT instruction */
 XEN_CPUFEATURE(CLWB,  5*32+24) /*A  CLWB instruction */
+XEN_CPUFEATURE(IPT,   5*32+25) /*   Intel Processor Trace */
 XEN_CPUFEATURE(AVX512PF,  5*32+26) /*A  AVX-512 Prefetch Instructions */
 XEN_CPUFEATURE(AVX512ER,  5*32+27) /*A  AVX-512 Exponent & Reciprocal 
Instrs */
 XEN_CPUFEATURE(AVX512CD,  5*32+28) /*A  AVX-512 Conflict Detection Instrs 
*/
-- 
2.20.1




[PATCH v3 2/7] x86/vmx: add Intel PT MSR definitions

2020-06-22 Thread Michał Leszczyński
Define constants related to Intel Processor Trace features.

Signed-off-by: Michal Leszczynski 
---
 xen/include/asm-x86/msr-index.h | 37 +
 1 file changed, 37 insertions(+)

diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index b328a47ed8..0203029be9 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -69,6 +69,43 @@
 #define MSR_MCU_OPT_CTRL0x0123
 #define  MCU_OPT_CTRL_RNGDS_MITG_DIS(_AC(1, ULL) <<  0)
 
+/* Intel PT MSRs */
+#define MSR_RTIT_OUTPUT_BASE0x0560
+
+#define MSR_RTIT_OUTPUT_MASK0x0561
+
+#define MSR_RTIT_CTL0x0570
+#define  RTIT_CTL_TRACEEN(_AC(1, ULL) <<  0)
+#define  RTIT_CTL_CYCEN  (_AC(1, ULL) <<  1)
+#define  RTIT_CTL_OS (_AC(1, ULL) <<  2)
+#define  RTIT_CTL_USR(_AC(1, ULL) <<  3)
+#define  RTIT_CTL_PWR_EVT_EN (_AC(1, ULL) <<  4)
+#define  RTIT_CTL_FUP_ON_PTW (_AC(1, ULL) <<  5)
+#define  RTIT_CTL_FABRIC_EN  (_AC(1, ULL) <<  6)
+#define  RTIT_CTL_CR3_FILTER (_AC(1, ULL) <<  7)
+#define  RTIT_CTL_TOPA   (_AC(1, ULL) <<  8)
+#define  RTIT_CTL_MTC_EN (_AC(1, ULL) <<  9)
+#define  RTIT_CTL_TSC_EN (_AC(1, ULL) <<  10)
+#define  RTIT_CTL_DIS_RETC   (_AC(1, ULL) <<  11)
+#define  RTIT_CTL_PTW_EN (_AC(1, ULL) <<  12)
+#define  RTIT_CTL_BRANCH_EN  (_AC(1, ULL) <<  13)
+#define  RTIT_CTL_MTC_FREQ   (_AC(0x0F, ULL) <<  14)
+#define  RTIT_CTL_CYC_THRESH (_AC(0x0F, ULL) <<  19)
+#define  RTIT_CTL_PSB_FREQ   (_AC(0x0F, ULL) <<  24)
+#define  RTIT_CTL_ADDR(n)(_AC(0x0F, ULL) <<  (32 + (4 * 
(n
+
+#define MSR_RTIT_STATUS 0x0571
+#define  RTIT_STATUS_FILTER_EN   (_AC(1, ULL) <<  0)
+#define  RTIT_STATUS_CONTEXT_EN  (_AC(1, ULL) <<  1)
+#define  RTIT_STATUS_TRIGGER_EN  (_AC(1, ULL) <<  2)
+#define  RTIT_STATUS_ERROR   (_AC(1, ULL) <<  4)
+#define  RTIT_STATUS_STOPPED (_AC(1, ULL) <<  5)
+#define  RTIT_STATUS_BYTECNT (_AC(0x1, ULL) <<  32)
+
+#define MSR_RTIT_CR3_MATCH  0x0572
+#define MSR_RTIT_ADDR_A(n)  (0x0580 + (n) * 2)
+#define MSR_RTIT_ADDR_B(n)  (0x0581 + (n) * 2)
+
 #define MSR_U_CET   0x06a0
 #define MSR_S_CET   0x06a2
 #define  CET_SHSTK_EN   (_AC(1, ULL) <<  0)
-- 
2.20.1




[PATCH v3 5/7] tools/libxc: add xc_vmtrace_* functions

2020-06-22 Thread Michał Leszczyński
Add functions in libxc that use the new HVMOP_vmtrace interface.

Signed-off-by: Michal Leszczynski 
---
 tools/libxc/Makefile  |  1 +
 tools/libxc/include/xenctrl.h | 39 +++
 tools/libxc/xc_vmtrace.c  | 94 +++
 3 files changed, 134 insertions(+)
 create mode 100644 tools/libxc/xc_vmtrace.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index fae5969a73..605e44501d 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -27,6 +27,7 @@ CTRL_SRCS-y   += xc_csched2.c
 CTRL_SRCS-y   += xc_arinc653.c
 CTRL_SRCS-y   += xc_rt.c
 CTRL_SRCS-y   += xc_tbuf.c
+CTRL_SRCS-y   += xc_vmtrace.c
 CTRL_SRCS-y   += xc_pm.c
 CTRL_SRCS-y   += xc_cpu_hotplug.c
 CTRL_SRCS-y   += xc_resume.c
diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 113ddd935d..66966f6c17 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1585,6 +1585,45 @@ int xc_tbuf_set_cpu_mask(xc_interface *xch, xc_cpumap_t 
mask);
 
 int xc_tbuf_set_evt_mask(xc_interface *xch, uint32_t mask);
 
+/**
+ * Enable processor trace for given vCPU in given DomU.
+ * Allocate the trace ringbuffer with given size.
+ *
+ * @parm xch a handle to an open hypervisor interface
+ * @parm domid domain identifier
+ * @parm vcpu vcpu identifier
+ * @return 0 on success, -1 on failure
+ */
+int xc_vmtrace_pt_enable(xc_interface *xch, uint32_t domid,
+ uint32_t vcpu);
+
+/**
+ * Disable processor trace for given vCPU in given DomU.
+ * Deallocate the trace ringbuffer.
+ *
+ * @parm xch a handle to an open hypervisor interface
+ * @parm domid domain identifier
+ * @parm vcpu vcpu identifier
+ * @return 0 on success, -1 on failure
+ */
+int xc_vmtrace_pt_disable(xc_interface *xch, uint32_t domid,
+  uint32_t vcpu);
+
+/**
+ * Get current offset inside the trace ringbuffer.
+ * This allows to determine how much data was written into the buffer.
+ * Once buffer overflows, the offset will reset to 0 and the previous
+ * data will be overriden.
+ *
+ * @parm xch a handle to an open hypervisor interface
+ * @parm domid domain identifier
+ * @parm vcpu vcpu identifier
+ * @parm offset current offset inside trace buffer will be written there
+ * @return 0 on success, -1 on failure
+ */
+int xc_vmtrace_pt_get_offset(xc_interface *xch, uint32_t domid,
+ uint32_t vcpu, uint64_t *offset);
+
 int xc_domctl(xc_interface *xch, struct xen_domctl *domctl);
 int xc_sysctl(xc_interface *xch, struct xen_sysctl *sysctl);
 
diff --git a/tools/libxc/xc_vmtrace.c b/tools/libxc/xc_vmtrace.c
new file mode 100644
index 00..79aad2d9a8
--- /dev/null
+++ b/tools/libxc/xc_vmtrace.c
@@ -0,0 +1,94 @@
+/**
+ * xc_vmtrace.c
+ *
+ * API for manipulating hardware tracing features
+ *
+ * Copyright (c) 2020, Michal Leszczynski
+ *
+ * Copyright 2020 CERT Polska. All rights reserved.
+ * Use is subject to license terms.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation;
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; If not, see .
+ */
+
+#include "xc_private.h"
+#include 
+
+int xc_vmtrace_pt_enable(
+xc_interface *xch, uint32_t domid, uint32_t vcpu)
+{
+DECLARE_HYPERCALL_BUFFER(xen_hvm_vmtrace_op_t, arg);
+int rc = -1;
+
+arg = xc_hypercall_buffer_alloc(xch, arg, sizeof(*arg));
+if ( arg == NULL )
+return -1;
+
+arg->cmd = HVMOP_vmtrace_pt_enable;
+arg->domain = domid;
+arg->vcpu = vcpu;
+
+rc = xencall2(xch->xcall, __HYPERVISOR_hvm_op, HVMOP_vmtrace,
+  HYPERCALL_BUFFER_AS_ARG(arg));
+
+xc_hypercall_buffer_free(xch, arg);
+return rc;
+}
+
+int xc_vmtrace_pt_get_offset(
+xc_interface *xch, uint32_t domid, uint32_t vcpu, uint64_t *offset)
+{
+DECLARE_HYPERCALL_BUFFER(xen_hvm_vmtrace_op_t, arg);
+int rc = -1;
+
+arg = xc_hypercall_buffer_alloc(xch, arg, sizeof(*arg));
+if ( arg == NULL )
+return -1;
+
+arg->cmd = HVMOP_vmtrace_pt_get_offset;
+arg->domain = domid;
+arg->vcpu = vcpu;
+
+rc = xencall2(xch->xcall, __HYPERVISOR_hvm_op, HVMOP_vmtrace,
+  HYPERCALL_BUFFER_AS_ARG(arg));
+
+if ( rc == 0 )
+{
+*offset = arg->offset;
+}
+
+xc_hypercall_buffer_free(xch, arg);
+return rc;
+}
+
+int xc_vmtrace_pt_disable(xc_interface *xch,

[PATCH v3 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Michał Leszczyński
Provide an interface for privileged domains to manage
external IPT monitoring. Guest IPT state will be preserved
across vmentry/vmexit using ipt_state structure.

Signed-off-by: Michal Leszczynski 
---
 xen/arch/x86/hvm/hvm.c | 168 +
 xen/arch/x86/hvm/vmx/vmx.c |  31 ++
 xen/arch/x86/mm.c  |  28 +
 xen/common/domain.c|   3 +
 xen/include/asm-x86/hvm/vmx/vmcs.h |   3 +
 xen/include/asm-x86/hvm/vmx/vmx.h  |  14 +++
 xen/include/public/domctl.h|   1 +
 xen/include/public/hvm/hvm_op.h|  26 +
 xen/include/public/hvm/params.h|   2 +-
 xen/include/public/memory.h|   1 +
 xen/include/xen/sched.h|   4 +
 xen/include/xlat.lst   |   1 +
 12 files changed, 281 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 5bb47583b3..5899df52c3 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -606,6 +607,57 @@ static int hvm_print_line(
 return X86EMUL_OKAY;
 }
 
+static int vmtrace_alloc_buffers(struct vcpu *v, uint64_t size)
+{
+struct page_info *pg;
+struct pt_state *pt;
+
+if ( size < PAGE_SIZE || size > GB(4) || (size & (size - 1)) )
+{
+/*
+ * We don't accept trace buffer size smaller than single page
+ * and the upper bound is defined as 4GB in the specification.
+ * The buffer size must be also a power of 2.
+ */
+return -EINVAL;
+}
+
+if ( vmx_add_host_load_msr(v, MSR_RTIT_CTL, 0) )
+return -EFAULT;
+
+pg = alloc_domheap_pages(v->domain, get_order_from_bytes(size),
+ MEMF_no_refcount);
+
+if ( !pg )
+return -ENOMEM;
+
+pt = xzalloc(struct pt_state);
+
+if ( !pt )
+return -ENOMEM;
+
+pt->output_base = page_to_maddr(pg);
+pt->output_mask.raw = size - 1;
+
+v->arch.hvm.vmx.pt_state = pt;
+
+return 0;
+}
+
+static void vmtrace_destroy_buffers(struct vcpu *v)
+{
+struct pt_state *pt = v->arch.hvm.vmx.pt_state;
+
+if ( pt )
+{
+free_domheap_pages(maddr_to_page(pt->output_base),
+   get_order_from_bytes(pt->output_mask.size + 1));
+
+xfree(pt);
+v->arch.hvm.vmx.pt_state = NULL;
+}
+}
+
 int hvm_domain_initialise(struct domain *d)
 {
 unsigned int nr_gsis;
@@ -747,7 +799,10 @@ void hvm_domain_relinquish_resources(struct domain *d)
 hpet_deinit(d);
 
 for_each_vcpu ( d, v )
+{
+vmtrace_destroy_buffers(v);
 hvmemul_cache_destroy(v);
+}
 }
 
 void hvm_domain_destroy(struct domain *d)
@@ -1594,6 +1649,13 @@ int hvm_vcpu_initialise(struct vcpu *v)
 hvm_set_guest_tsc(v, 0);
 }
 
+if ( d->vmtrace_pt_size )
+{
+rc = vmtrace_alloc_buffers(v, d->vmtrace_pt_size);
+if ( rc != 0 )
+goto fail1;
+}
+
 return 0;
 
  fail6:
@@ -4949,6 +5011,108 @@ static int compat_altp2m_op(
 return rc;
 }
 
+CHECK_hvm_vmtrace_op;
+
+static int do_vmtrace_op(XEN_GUEST_HANDLE_PARAM(void) arg)
+{
+struct xen_hvm_vmtrace_op a;
+struct domain *d;
+int rc;
+struct vcpu *v;
+struct pt_state *pt;
+
+if ( !hvm_pt_supported() )
+return -EOPNOTSUPP;
+
+if ( copy_from_guest(&a, arg, 1) )
+return -EFAULT;
+
+if ( a.pad1 || a.pad2 )
+return -EINVAL;
+
+rc = rcu_lock_live_remote_domain_by_id(a.domain, &d);
+
+if ( rc )
+goto out;
+
+if ( !is_hvm_domain(d) )
+{
+rc = -EOPNOTSUPP;
+goto out;
+}
+
+if ( a.vcpu >= d->max_vcpus )
+{
+rc = -EINVAL;
+goto out;
+}
+
+v = domain_vcpu(d, a.vcpu);
+pt = v->arch.hvm.vmx.pt_state;
+
+if ( !pt )
+{
+/* PT must be first initialized upon domain creation. */
+rc = -EINVAL;
+goto out;
+}
+
+switch ( a.cmd )
+{
+case HVMOP_vmtrace_pt_enable:
+vcpu_pause(v);
+spin_lock(&d->vmtrace_lock);
+if ( vmx_add_guest_msr(v, MSR_RTIT_CTL,
+   RTIT_CTL_TRACEEN | RTIT_CTL_OS |
+   RTIT_CTL_USR | RTIT_CTL_BRANCH_EN) )
+{
+rc = -EFAULT;
+goto out;
+}
+
+pt->active = 1;
+spin_unlock(&d->vmtrace_lock);
+vcpu_unpause(v);
+break;
+
+case HVMOP_vmtrace_pt_disable:
+vcpu_pause(v);
+spin_lock(&d->vmtrace_lock);
+
+if ( vmx_del_msr(v, MSR_RTIT_CTL, VMX_MSR_GUEST) )
+{
+rc = -EFAULT;
+goto out;
+}
+
+pt->active = 0;
+spin_unlock(&d->vmtrace_lock);
+vcpu_unpause(v);
+break;
+
+case HVMOP_vmtrace_pt_get_offset:
+a.offset = pt->output_mask.offset;
+
+if ( __copy_field_to_guest(guest_handle_cast(arg, 
xen_hvm_vmtrac

[PATCH v3 6/7] tools/libxl: add vmtrace_pt_size parameter

2020-06-22 Thread Michał Leszczyński
Allow to specify the size of per-vCPU trace buffer upon
domain creation. This is zero by default (meaning: not enabled).

Signed-off-by: Michal Leszczynski 
---
 docs/man/xl.cfg.5.pod.in | 10 ++
 tools/golang/xenlight/helpers.gen.go |  2 ++
 tools/golang/xenlight/types.gen.go   |  1 +
 tools/libxl/libxl_create.c   |  1 +
 tools/libxl/libxl_types.idl  |  2 ++
 tools/xl/xl_parse.c  |  4 
 6 files changed, 20 insertions(+)

diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
index 0532739c1f..78f434b722 100644
--- a/docs/man/xl.cfg.5.pod.in
+++ b/docs/man/xl.cfg.5.pod.in
@@ -278,6 +278,16 @@ memory=8096 will report significantly less memory 
available for use
 than a system with maxmem=8096 memory=8096 due to the memory overhead
 of having to track the unused pages.
 
+=item B
+
+Specifies the size of processor trace buffer that would be allocated
+for each vCPU belonging to this domain. Disabled (i.e. B
+by default. This must be set to non-zero value in order to be able to
+use processor tracing features with this domain.
+
+B: The size value must be between 4 kB and 4 GB and it must
+be also a power of 2.
+
 =back
 
 =head3 Guest Virtual NUMA Configuration
diff --git a/tools/golang/xenlight/helpers.gen.go 
b/tools/golang/xenlight/helpers.gen.go
index 935d3bc50a..986ebbd681 100644
--- a/tools/golang/xenlight/helpers.gen.go
+++ b/tools/golang/xenlight/helpers.gen.go
@@ -1117,6 +1117,7 @@ return fmt.Errorf("invalid union key '%v'", x.Type)}
 x.ArchArm.GicVersion = GicVersion(xc.arch_arm.gic_version)
 x.ArchArm.Vuart = VuartType(xc.arch_arm.vuart)
 x.Altp2M = Altp2MMode(xc.altp2m)
+x.VmtracePtSize = int(xc.vmtrace_pt_size)
 
  return nil}
 
@@ -1592,6 +1593,7 @@ return fmt.Errorf("invalid union key '%v'", x.Type)}
 xc.arch_arm.gic_version = C.libxl_gic_version(x.ArchArm.GicVersion)
 xc.arch_arm.vuart = C.libxl_vuart_type(x.ArchArm.Vuart)
 xc.altp2m = C.libxl_altp2m_mode(x.Altp2M)
+xc.vmtrace_pt_size = C.int(x.VmtracePtSize)
 
  return nil
  }
diff --git a/tools/golang/xenlight/types.gen.go 
b/tools/golang/xenlight/types.gen.go
index 663c1e86b4..41ec7cdd32 100644
--- a/tools/golang/xenlight/types.gen.go
+++ b/tools/golang/xenlight/types.gen.go
@@ -516,6 +516,7 @@ GicVersion GicVersion
 Vuart VuartType
 }
 Altp2M Altp2MMode
+VmtracePtSize int
 }
 
 type domainBuildInfoTypeUnion interface {
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 75862dc6ed..32204b83b0 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -608,6 +608,7 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config 
*d_config,
 .max_evtchn_port = b_info->event_channels,
 .max_grant_frames = b_info->max_grant_frames,
 .max_maptrack_frames = b_info->max_maptrack_frames,
+.vmtrace_pt_size = b_info->vmtrace_pt_size,
 };
 
 if (info->type != LIBXL_DOMAIN_TYPE_PV) {
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 9d3f05f399..04c1704b72 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -645,6 +645,8 @@ libxl_domain_build_info = Struct("domain_build_info",[
 # supported by x86 HVM and ARM support is planned.
 ("altp2m", libxl_altp2m_mode),
 
+("vmtrace_pt_size", integer),
+
 ], dir=DIR_IN,
copy_deprecated_fn="libxl__domain_build_info_copy_deprecated",
 )
diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c
index 61b4ef7b7e..6ab98dda55 100644
--- a/tools/xl/xl_parse.c
+++ b/tools/xl/xl_parse.c
@@ -1861,6 +1861,10 @@ void parse_config_data(const char *config_source,
 }
 }
 
+if (!xlu_cfg_get_long(config, "vmtrace_pt_size", &l, 1)) {
+b_info->vmtrace_pt_size = l;
+}
+
 if (!xlu_cfg_get_list(config, "ioports", &ioports, &num_ioports, 0)) {
 b_info->num_ioports = num_ioports;
 b_info->ioports = calloc(num_ioports, sizeof(*b_info->ioports));
-- 
2.20.1




[PATCH v3 7/7] tools/proctrace: add proctrace tool

2020-06-22 Thread Michał Leszczyński
Add an demonstration tool that uses xc_vmtrace_* calls in order
to manage external IPT monitoring for DomU.

Signed-off-by: Michal Leszczynski 
---
 tools/proctrace/COPYING | 339 
 tools/proctrace/Makefile|  50 ++
 tools/proctrace/proctrace.c | 158 +
 3 files changed, 547 insertions(+)
 create mode 100644 tools/proctrace/COPYING
 create mode 100644 tools/proctrace/Makefile
 create mode 100644 tools/proctrace/proctrace.c

diff --git a/tools/proctrace/COPYING b/tools/proctrace/COPYING
new file mode 100644
index 00..c0a841112c
--- /dev/null
+++ b/tools/proctrace/COPYING
@@ -0,0 +1,339 @@
+   GNU GENERAL PUBLIC LICENSE
+  Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+   59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+   Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users.  This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it.  (Some other Free Software Foundation software is covered by
+the GNU Library General Public License instead.)  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+  To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have.  You must make sure that they, too, receive or can get the
+source code.  And you must show them these terms so they know their
+rights.
+
+  We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+  Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software.  If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+  Finally, any free program is threatened constantly by software
+patents.  We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary.  To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+   GNU GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License.  The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language.  (Hereinafter, translation is included without limitation in
+the term "modification".)  Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+  1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and di

Re: [PATCH v3 0/7] Implement support for external IPT monitoring

2020-06-22 Thread Michał Leszczyński
- 22 cze 2020 o 20:06, Michał Leszczyński michal.leszczyn...@cert.pl 
napisał(a):

> This patch series implements an interface that Dom0 could use in order to
> enable IPT for particular vCPUs in DomU, allowing for external monitoring. 
> Such
> a feature has numerous applications like malware monitoring, fuzzing, or
> performance testing.


There is also a git branch with these patches:
https://github.com/icedevml/xen/tree/ipt-patch-v3b



Re: [RFC PATCH] xen/privcmd: Convert get_user_pages*() to pin_user_pages*()

2020-06-22 Thread Souptick Joarder
On Fri, Jun 19, 2020 at 1:00 PM John Hubbard  wrote:
>
> On 2020-06-18 20:12, Souptick Joarder wrote:
> > On Wed, Jun 17, 2020 at 11:29 PM Boris Ostrovsky
> >  wrote:
> >>
> >> On 6/16/20 11:14 PM, Souptick Joarder wrote:
> >>> In 2019, we introduced pin_user_pages*() and now we are converting
> >>> get_user_pages*() to the new API as appropriate. [1] & [2] could
> >>> be referred for more information.
>
>
> Ideally, the commit description should say which case, in
> pin_user_pages.rst, that this is.
>

Ok.

>
> >>>
> >>> [1] Documentation/core-api/pin_user_pages.rst
> >>>
> >>> [2] "Explicit pinning of user-space pages":
> >>>  https://lwn.net/Articles/807108/
> >>>
> >>> Signed-off-by: Souptick Joarder 
> >>> Cc: John Hubbard 
> >>> ---
> >>> Hi,
> >>>
> >>> I have compile tested this patch but unable to run-time test,
> >>> so any testing help is much appriciated.
> >>>
> >>> Also have a question, why the existing code is not marking the
> >>> pages dirty (since it did FOLL_WRITE) ?
> >>
> >>
> >> Indeed, seems to me it should. Paul?
>
> Definitely good to get an answer from an expert in this code, but
> meanwhile, it's reasonable to just mark them dirty. Below...
>
> >>
> >>
> >>>
> >>>   drivers/xen/privcmd.c | 7 ++-
> >>>   1 file changed, 2 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> >>> index a250d11..543739e 100644
> >>> --- a/drivers/xen/privcmd.c
> >>> +++ b/drivers/xen/privcmd.c
> >>> @@ -594,7 +594,7 @@ static int lock_pages(
> >>>if (requested > nr_pages)
> >>>return -ENOSPC;
> >>>
> >>> - pinned = get_user_pages_fast(
> >>> + pinned = pin_user_pages_fast(
> >>>(unsigned long) kbufs[i].uptr,
> >>>requested, FOLL_WRITE, pages);
> >>>if (pinned < 0)
> >>> @@ -614,10 +614,7 @@ static void unlock_pages(struct page *pages[], 
> >>> unsigned int nr_pages)
> >>>if (!pages)
> >>>return;
> >>>
> >>> - for (i = 0; i < nr_pages; i++) {
> >>> - if (pages[i])
> >>> - put_page(pages[i]);
> >>> - }
> >>> + unpin_user_pages(pages, nr_pages);
>
>
> ...so just use unpin_user_pages_dirty_lock() here, I think.
>
>
> >>
> >>
> >> Why are you no longer checking for valid pages?
> >
> > My understanding is, in case of lock_pages() end up returning partial
> > mapped pages,
> > we should pass no. of partial mapped pages to unlock_pages(), not nr_pages.
> > This will avoid checking extra check to validate the pages[i].
> >
> > and if lock_pages() returns 0 in success, anyway we have all the pages[i] 
> > valid.
> > I will try to correct it in v2.
> >
> > But I agree, there is no harm to check for pages[i] and I believe,
>
>
> Generally, it *is* harmful to do unnecessary checks, in most code, but 
> especially
> in most kernel code. If you can convince yourself that the check for null 
> pages
> is redundant here, then please let's remove that check. The code becomes then
> becomes shorter, simpler, and faster.

I read the code again. I think, this check is needed to handle a scenario when
lock_pages() return -ENOSPC. Better to keep this check. Let me post v2 of this
RFC for a clear view.

>
>
> > unpin_user_pages()
> > is the right place to do so.
> >
> > John any thought ?
>
>
> So far I haven't seen any cases to justify changing the implementation of
> unpin_user_pages().
>
>
> thanks,
> --
> John Hubbard
> NVIDIA



Re: [RFC PATCH] xen/privcmd: Convert get_user_pages*() to pin_user_pages*()

2020-06-22 Thread Boris Ostrovsky
On 6/22/20 2:52 PM, Souptick Joarder wrote:
>
> I read the code again. I think, this check is needed to handle a scenario when
> lock_pages() return -ENOSPC. Better to keep this check. Let me post v2 of this
> RFC for a clear view.


Actually, error handling seems to be somewhat broken here. If
lock_pages() returns number of pinned pages then that's what we end up
returning from privcmd_ioctl_dm_op(), all the way to user ioctl(). Which
I don't think is right, we should return proper (negative) error.


Do you mind fixing that we well? Then you should be able to avoid
testing pages in a loop.


-boris




Re: [RFC PATCH] xen/privcmd: Convert get_user_pages*() to pin_user_pages*()

2020-06-22 Thread Souptick Joarder
On Tue, Jun 23, 2020 at 12:40 AM Boris Ostrovsky
 wrote:
>
> On 6/22/20 2:52 PM, Souptick Joarder wrote:
> >
> > I read the code again. I think, this check is needed to handle a scenario 
> > when
> > lock_pages() return -ENOSPC. Better to keep this check. Let me post v2 of 
> > this
> > RFC for a clear view.
>
>
> Actually, error handling seems to be somewhat broken here. If
> lock_pages() returns number of pinned pages then that's what we end up
> returning from privcmd_ioctl_dm_op(), all the way to user ioctl(). Which
> I don't think is right, we should return proper (negative) error.
>

What -ERRNO is more appropriate here ? -ENOSPC ?

>
> Do you mind fixing that we well? Then you should be able to avoid
> testing pages in a loop.

Ok, let me try to fix it.
>
>
> -boris
>



Re: [RFC PATCH] xen/privcmd: Convert get_user_pages*() to pin_user_pages*()

2020-06-22 Thread Boris Ostrovsky
On 6/22/20 3:28 PM, Souptick Joarder wrote:
> On Tue, Jun 23, 2020 at 12:40 AM Boris Ostrovsky
>  wrote:
>> On 6/22/20 2:52 PM, Souptick Joarder wrote:
>>> I read the code again. I think, this check is needed to handle a scenario 
>>> when
>>> lock_pages() return -ENOSPC. Better to keep this check. Let me post v2 of 
>>> this
>>> RFC for a clear view.
>>
>> Actually, error handling seems to be somewhat broken here. If
>> lock_pages() returns number of pinned pages then that's what we end up
>> returning from privcmd_ioctl_dm_op(), all the way to user ioctl(). Which
>> I don't think is right, we should return proper (negative) error.
>>
> What -ERRNO is more appropriate here ? -ENOSPC ?


You can simply pass along error code that get_user_pages_fast() returned.


-boris




sysbus failed assert for xen_sysdev

2020-06-22 Thread Jason Andryuk
Hi,

Running qemu devel for a Xen VM is failing an assert after the recent
"qdev: Rework how we plug into the parent bus" sysbus changes.

qemu-system-i386: hw/core/qdev.c:102: qdev_set_parent_bus: Assertion
`dc->bus_type && object_dynamic_cast(OBJECT(bus), dc->bus_type)'
failed.

dc->bus_type is "xen-sysbus" and it's the
`object_dynamic_cast(OBJECT(bus), dc->bus_type)` portion that fails
the assert.  bus seems to be "main-system-bus", I think:
(gdb) p *bus
$3 = {obj = {class = 0x5636d780, free = 0x77c40db0 ,
properties = 0x563f7180, ref = 3, parent = 0x563fe980}, parent
= 0x0, name = 0x563fec60 "main-system-bus", ...

The call comes from hw/xen/xen-legacy-backend.c:706
sysbus_realize_and_unref(SYS_BUS_DEVICE(xen_sysdev), &error_fatal);

Any pointers on what needs to be fixed?

Thanks,
Jason



[qemu-mainline bisection] complete test-amd64-i386-freebsd10-amd64

2020-06-22 Thread osstest service owner
branch xen-unstable
xenbranch xen-unstable
job test-amd64-i386-freebsd10-amd64
testid guest-start

Tree: linux git://xenbits.xen.org/linux-pvops.git
Tree: linuxfirmware git://xenbits.xen.org/osstest/linux-firmware.git
Tree: ovmf git://xenbits.xen.org/osstest/ovmf.git
Tree: qemu git://xenbits.xen.org/qemu-xen-traditional.git
Tree: qemuu git://git.qemu.org/qemu.git
Tree: seabios git://xenbits.xen.org/osstest/seabios.git
Tree: xen git://xenbits.xen.org/xen.git

*** Found and reproduced problem changeset ***

  Bug is in tree:  qemuu git://git.qemu.org/qemu.git
  Bug introduced:  81cb05732efb36971901c515b007869cc1d3a532
  Bug not present: d6b78ac8ecf94f56dbfbecc23fb4365d8772a41a
  Last fail repro: http://logs.test-lab.xenproject.org/osstest/logs/151298/


  commit 81cb05732efb36971901c515b007869cc1d3a532
  Author: Markus Armbruster 
  Date:   Tue Jun 9 14:23:37 2020 +0200
  
  qdev: Assert devices are plugged into a bus that can take them
  
  This would have caught some of the bugs I just fixed.
  
  Signed-off-by: Markus Armbruster 
  Reviewed-by: Mark Cave-Ayland 
  Message-Id: <20200609122339.937862-23-arm...@redhat.com>


For bisection revision-tuple graph see:
   
http://logs.test-lab.xenproject.org/osstest/results/bisect/qemu-mainline/test-amd64-i386-freebsd10-amd64.guest-start.html
Revision IDs in each graph node refer, respectively, to the Trees above.


Running cs-bisection-step 
--graph-out=/home/logs/results/bisect/qemu-mainline/test-amd64-i386-freebsd10-amd64.guest-start
 --summary-out=tmp/151298.bisection-summary --basis-template=151065 
--blessings=real,real-bisect qemu-mainline test-amd64-i386-freebsd10-amd64 
guest-start
Searching for failure / basis pass:
 151269 fail [host=chardonnay1] / 151149 [host=pinot1] 151101 [host=fiano1] 
151065 [host=huxelrebe0] 151047 [host=fiano0] 150970 [host=pinot0] 150930 
[host=huxelrebe1] 150916 [host=elbling0] 150909 [host=huxelrebe0] 150895 
[host=debina1] 150831 [host=elbling1] 150694 [host=fiano0] 150631 ok.
Failure / basis pass flights: 151269 / 150631
(tree with no url: minios)
Tree: linux git://xenbits.xen.org/linux-pvops.git
Tree: linuxfirmware git://xenbits.xen.org/osstest/linux-firmware.git
Tree: ovmf git://xenbits.xen.org/osstest/ovmf.git
Tree: qemu git://xenbits.xen.org/qemu-xen-traditional.git
Tree: qemuu git://git.qemu.org/qemu.git
Tree: seabios git://xenbits.xen.org/osstest/seabios.git
Tree: xen git://xenbits.xen.org/xen.git
Latest c3038e718a19fc596f7b1baba0f83d5146dc7784 
c530a75c1e6a472b0eb9558310b518f0dfcd8860 
239b50a863704f7960525799eda82de061c7c458 
3c659044118e34603161457db9934a34f816d78b 
06c4cc3660b366278bdc7bc8b6677032d7b1118c 
2e3de6253422112ae43e608661ba94ea6b345694 
3625b04991b4d6affadd99d377ab84bac48dfff4
Basis pass c3038e718a19fc596f7b1baba0f83d5146dc7784 
c530a75c1e6a472b0eb9558310b518f0dfcd8860 
ca407c7246bf405da6d9b1b9d93e5e7f17b4b1f9 
3c659044118e34603161457db9934a34f816d78b 
5cc7a54c2e91d82cb6a52e4921325c511fd90712 
2e3de6253422112ae43e608661ba94ea6b345694 
1497e78068421d83956f8e82fb6e1bf1fc3b1199
Generating revisions with ./adhoc-revtuple-generator  
git://xenbits.xen.org/linux-pvops.git#c3038e718a19fc596f7b1baba0f83d5146dc7784-c3038e718a19fc596f7b1baba0f83d5146dc7784
 
git://xenbits.xen.org/osstest/linux-firmware.git#c530a75c1e6a472b0eb9558310b518f0dfcd8860-c530a75c1e6a472b0eb9558310b518f0dfcd8860
 
git://xenbits.xen.org/osstest/ovmf.git#ca407c7246bf405da6d9b1b9d93e5e7f17b4b1f9-239b50a863704f7960525799eda82de061c7c458
 git://xenbits.xen.org/qemu-xen-traditional.git#3c659044118e34603161457db99\
 34a34f816d78b-3c659044118e34603161457db9934a34f816d78b 
git://git.qemu.org/qemu.git#5cc7a54c2e91d82cb6a52e4921325c511fd90712-06c4cc3660b366278bdc7bc8b6677032d7b1118c
 
git://xenbits.xen.org/osstest/seabios.git#2e3de6253422112ae43e608661ba94ea6b345694-2e3de6253422112ae43e608661ba94ea6b345694
 
git://xenbits.xen.org/xen.git#1497e78068421d83956f8e82fb6e1bf1fc3b1199-3625b04991b4d6affadd99d377ab84bac48dfff4
Loaded 84118 nodes in revision graph
Searching for test results:
 150631 pass c3038e718a19fc596f7b1baba0f83d5146dc7784 
c530a75c1e6a472b0eb9558310b518f0dfcd8860 
ca407c7246bf405da6d9b1b9d93e5e7f17b4b1f9 
3c659044118e34603161457db9934a34f816d78b 
5cc7a54c2e91d82cb6a52e4921325c511fd90712 
2e3de6253422112ae43e608661ba94ea6b345694 
1497e78068421d83956f8e82fb6e1bf1fc3b1199
 150694 [host=fiano0]
 150831 [host=elbling1]
 150909 [host=huxelrebe0]
 150930 [host=huxelrebe1]
 150916 [host=elbling0]
 150895 [host=debina1]
 150899 []
 150970 [host=pinot0]
 151047 [host=fiano0]
 151101 [host=fiano1]
 151065 [host=huxelrebe0]
 151149 [host=pinot1]
 151244 pass irrelevant
 151221 fail irrelevant
 151239 blocked irrelevant
 151253 pass c3038e718a19fc596f7b1baba0f83d5146dc7784 
c530a75c1e6a472b0eb9558310b518f0dfcd8860 
567bc4b4ae8a975791382dd30ac413bc0d3ce88c 
3c659044118e34603161457db9934a34f816d78b 
eea8f5df4ecc607d64f091b8d916fcc11a697541 
2e3de6253422112ae43e608661ba94ea6b345694 
2

Re: sysbus failed assert for xen_sysdev

2020-06-22 Thread Mark Cave-Ayland
On 22/06/2020 21:33, Jason Andryuk wrote:

> Hi,
> 
> Running qemu devel for a Xen VM is failing an assert after the recent
> "qdev: Rework how we plug into the parent bus" sysbus changes.
> 
> qemu-system-i386: hw/core/qdev.c:102: qdev_set_parent_bus: Assertion
> `dc->bus_type && object_dynamic_cast(OBJECT(bus), dc->bus_type)'
> failed.
> 
> dc->bus_type is "xen-sysbus" and it's the
> `object_dynamic_cast(OBJECT(bus), dc->bus_type)` portion that fails
> the assert.  bus seems to be "main-system-bus", I think:
> (gdb) p *bus
> $3 = {obj = {class = 0x5636d780, free = 0x77c40db0 ,
> properties = 0x563f7180, ref = 3, parent = 0x563fe980}, parent
> = 0x0, name = 0x563fec60 "main-system-bus", ...
> 
> The call comes from hw/xen/xen-legacy-backend.c:706
> sysbus_realize_and_unref(SYS_BUS_DEVICE(xen_sysdev), &error_fatal);
> 
> Any pointers on what needs to be fixed?

Hi Jason,

My understanding is that the assert() is telling you that you're plugging a
TYPE_SYS_BUS_DEVICE into a bus that isn't derived from TYPE_SYSTEM_BUS. A quick 
look
at the file in question suggests that you could try changing the parent class of
TYPE_XENSYSBUS from TYPE_BUS to TYPE_SYSTEM_BUS to see if that helps?


ATB,

Mark.



[linux-linus test] 151283: regressions - FAIL

2020-06-22 Thread osstest service owner
flight 151283 linux-linus real [real]
http://logs.test-lab.xenproject.org/osstest/logs/151283/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-xl-qemuu-ovmf-amd64  7 xen-boot  fail REGR. vs. 151214

Regressions which are regarded as allowable (not blocking):
 test-armhf-armhf-xl-rtds16 guest-start/debian.repeat fail REGR. vs. 151214

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stopfail like 151214
 test-armhf-armhf-libvirt 14 saverestore-support-checkfail  like 151214
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stopfail like 151214
 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stopfail like 151214
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop fail like 151214
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop fail like 151214
 test-armhf-armhf-libvirt-raw 13 saverestore-support-checkfail  like 151214
 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stopfail like 151214
 test-amd64-i386-xl-pvshim12 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  14 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 13 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  13 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  14 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-vhd 12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 14 saverestore-support-checkfail never pass
 test-armhf-armhf-xl  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 14 saverestore-support-checkfail   never pass
 test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop  fail never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  13 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 12 migrate-support-checkfail   never pass
 test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2  fail never pass
 test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop  fail never pass

version targeted for testing:
 linux48778464bb7d346b47157d21ffde2af6b2d39110
baseline version:
 linux1b5044021070efa3259f3e9548dc35d1eb6aa844

Last test of basis   151214  2020-06-18 02:27:46 Z4 days
Failing since151236  2020-06-19 19:10:35 Z3 days3 attempts
Testing same since   151283  2020-06-22 02:26:44 Z0 days1 attempts


Re: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op

2020-06-22 Thread Michał Leszczyński
- 22 cze 2020 o 18:16, Jan Beulich jbeul...@suse.com napisał(a):

> On 22.06.2020 18:02, Michał Leszczyński wrote:
>> - 22 cze 2020 o 17:22, Jan Beulich jbeul...@suse.com napisał(a):
>>> On 22.06.2020 16:35, Michał Leszczyński wrote:
 - 22 cze 2020 o 15:25, Jan Beulich jbeul...@suse.com napisał(a):
> Is any of what you do in this switch() actually legitimate without
> hvm_set_vmtrace_pt_size() having got called for the guest? From
> remarks elsewhere I imply you expect the param that you currently
> use to be set upon domain creation time, but at the very least the
> potentially big buffer should imo not get allocated up front, but
> only when tracing is to actually be enabled.

 Wait... so you want to allocate these buffers in runtime?
 Previously we were talking that there is too much runtime logic
 and these enable/disable hypercalls should be stripped to absolute
 minimum.
>>>
>>> Basic arrangements can be made at domain creation time. I don't
>>> think though that it would be a good use of memory if you
>>> allocated perhaps many gigabytes of memory just for possibly
>>> wanting to enable tracing on a guest.
>>>
>> 
>> From our previous conversations I thought that you want to have
>> as much logic moved to the domain creation as possible.
>> 
>> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
>> zero (= disabled), if you set it to a non-zero value, then trace buffers
>> of given size will be allocated for the domain and you have possibility
>> to use ipt_enable/ipt_disable at any moment.
>> 
>> This way the runtime logic is as thin as possible. I assume user knows
>> in advance whether he/she would want to use external monitoring with IPT
>> or not.
> 
> Andrew - I think you requested movement to domain_create(). Could
> you clarify if indeed you mean to also allocate the big buffers
> this early?

I would like to recall what Andrew wrote few days ago:

- 16 cze 2020 o 22:16, Andrew Cooper andrew.coop...@citrix.com wrote:
> Xen has traditionally opted for a "and turn this extra thing on
> dynamically" model, but this has caused no end of security issues and
> broken corner cases.
> 
> You can see this still existing in the difference between
> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
> required to chose the number of vcpus for the domain) and we're making
> good progress undoing this particular wart (before 4.13, it was
> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
> issuing other hypercalls between these two).
> 
> There is a lot of settings which should be immutable for the lifetime of
> the domain, and external monitoring looks like another one of these.
> Specifying it at createdomain time allows for far better runtime
> behaviour (you are no longer in a situation where the first time you try
> to turn tracing on, you end up with -ENOMEM because another VM booted in
> the meantime and used the remaining memory), and it makes for rather
> more simple code in Xen itself (at runtime, you can rely on it having
> been set up properly, because a failure setting up will have killed the
> domain already).
> 
> ...
> 
> ~Andrew

according to this quote I've moved buffer allocation to the create_domain,
the updated version was already sent to the list as patch v3.

Best regards,
Michał Leszczyński
CERT Polska



Re: [PATCH v2 2/2] optee: allow plain TMEM buffers with NULL address

2020-06-22 Thread Stefano Stabellini
On Fri, 19 Jun 2020, Volodymyr Babchuk wrote:
> Trusted Applications use popular approach to determine required size
> of buffer: client provides a memory reference with the NULL pointer to
> a buffer. This is so called "Null memory reference". TA updates the
> reference with the required size and returns it back to the
> client. Then client allocates buffer of needed size and repeats the
> operation.
> 
> This behavior is described in TEE Client API Specification, paragraph
> 3.2.5. Memory References.
> 
> OP-TEE represents this null memory reference as a TMEM parameter with
> buf_ptr = 0x0. This is the only case when we should allow TMEM
> buffer without the OPTEE_MSG_ATTR_NONCONTIG flag. This also the
> special case for a buffer with OPTEE_MSG_ATTR_NONCONTIG flag.
> 
> This could lead to a potential issue, because IPA 0x0 is a valid
> address, but OP-TEE will treat it as a special case. So, care should
> be taken when construction OP-TEE enabled guest to make sure that such
> guest have no memory at IPA 0x0 and none of its memory is mapped at PA
> 0x0.
> 
> Signed-off-by: Volodymyr Babchuk 
> ---
> 
> Changes from v1:
>  - Added comment with TODO about possible PA/IPA 0x0 issue
>  - The same is described in the commit message
>  - Added check in translate_noncontig() for the NULL ptr buffer
> 
> ---
>  xen/arch/arm/tee/optee.c | 27 ---
>  1 file changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/arm/tee/optee.c b/xen/arch/arm/tee/optee.c
> index 6963238056..70bfef7e5f 100644
> --- a/xen/arch/arm/tee/optee.c
> +++ b/xen/arch/arm/tee/optee.c
> @@ -215,6 +215,15 @@ static bool optee_probe(void)
>  return true;
>  }
>  
> +/*
> + * TODO: There is a potential issue with guests that either have RAM
> + * at IPA of 0x0 or some of theirs memory is mapped at PA 0x0. This is
   ^ their

> + * because PA of 0x0 is considered as NULL pointer by OP-TEE. It will
> + * not be able to map buffer with such pointer to TA address space, or
> + * use such buffer for communication with the guest. We either need to
> + * check that guest have no such mappings or ensure that OP-TEE
> + * enabled guest will not be created with such mappings.
> + */
>  static int optee_domain_init(struct domain *d)
>  {
>  struct arm_smccc_res resp;
> @@ -725,6 +734,15 @@ static int translate_noncontig(struct optee_domain *ctx,
>  uint64_t next_page_data;
>  } *guest_data, *xen_data;
>  
> +/*
> + * Special case: buffer with buf_ptr == 0x0 is considered as NULL
> + * pointer by OP-TEE. No translation is needed. This can lead to
> + * an issue as IPA 0x0 is a valid address for Xen. See the comment
> + * near optee_domain_init()
> + */
> +if ( !param->u.tmem.buf_ptr )
> +return 0;

Given that today it is not possible for this to happen, it could even be
an ASSERT. But I think I would just return an error, maybe -EINVAL?

Aside from this, and the small grammar issue, everything else looks fine
to me.

Let's wait for Julien's reply, but if this is the only thing I could fix
on commit.


>  /* Offset of user buffer withing OPTEE_MSG_NONCONTIG_PAGE_SIZE-sized 
> page */
>  offset = param->u.tmem.buf_ptr & (OPTEE_MSG_NONCONTIG_PAGE_SIZE - 1);
>  
> @@ -865,9 +883,12 @@ static int translate_params(struct optee_domain *ctx,
>  }
>  else
>  {
> -gdprintk(XENLOG_WARNING, "Guest tries to use old tmem 
> arg\n");
> -ret = -EINVAL;
> -goto out;
> +if ( call->xen_arg->params[i].u.tmem.buf_ptr )
> +{
> +gdprintk(XENLOG_WARNING, "Guest tries to use old tmem 
> arg\n");
> +ret = -EINVAL;
> +goto out;
> +}
>  }
>  break;
>  case OPTEE_MSG_ATTR_TYPE_NONE:
> -- 
> 2.26.2
> 



Re: [PATCH v2 1/2] optee: immediately free buffers that are released by OP-TEE

2020-06-22 Thread Stefano Stabellini
On Fri, 19 Jun 2020, Volodymyr Babchuk wrote:
> Normal World can share buffer with OP-TEE for two reasons:
> 1. Some client application wants to exchange data with TA
> 2. OP-TEE asks for shared buffer for internal needs
> 
> The second case was handle more strictly than necessary:
> 
> 1. In RPC request OP-TEE asks for buffer
> 2. NW allocates buffer and provides it via RPC response
> 3. Xen pins pages and translates data
> 4. Xen provides buffer to OP-TEE
> 5. OP-TEE uses it
> 6. OP-TEE sends request to free the buffer
> 7. NW frees the buffer and sends the RPC response
> 8. Xen unpins pages and forgets about the buffer
> 
> The problem is that Xen should forget about buffer in between stages 6
> and 7. I.e. the right flow should be like this:
> 
> 6. OP-TEE sends request to free the buffer
> 7. Xen unpins pages and forgets about the buffer
> 8. NW frees the buffer and sends the RPC response
> 
> This is because OP-TEE internally frees the buffer before sending the
> "free SHM buffer" request. So we have no reason to hold reference for
> this buffer anymore. Moreover, in multiprocessor systems NW have time
> to reuse buffer cookie for another buffer. Xen complained about this
> and denied the new buffer registration. I have seen this issue while
> running tests on iMX SoC.
> 
> So, this patch basically corrects that behavior by freeing the buffer
> earlier, when handling RPC return from OP-TEE.
> 
> Signed-off-by: Volodymyr Babchuk 

There are a couple of grammar issues in the comments, but we can fix
them on commit.

Acked-by: Stefano Stabellini 



> ---
> 
> Changes from v1:
>  - reworded the comments
>  - added WARN() for a case when OP-TEE wants to release not the
>buffer it requeset to allocate durint this call
> 
> ---
>  xen/arch/arm/tee/optee.c | 32 
>  1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/arm/tee/optee.c b/xen/arch/arm/tee/optee.c
> index 6a035355db..6963238056 100644
> --- a/xen/arch/arm/tee/optee.c
> +++ b/xen/arch/arm/tee/optee.c
> @@ -1099,6 +1099,34 @@ static int handle_rpc_return(struct optee_domain *ctx,
>  if ( shm_rpc->xen_arg->cmd == OPTEE_RPC_CMD_SHM_ALLOC )
>  call->rpc_buffer_type = shm_rpc->xen_arg->params[0].u.value.a;
>  
> +/*
> + * OP-TEE is signalling that it has freed the buffer that it
> + * requested before. This is the right time for us to do the
> + * same.
> + */
> +if ( shm_rpc->xen_arg->cmd == OPTEE_RPC_CMD_SHM_FREE )
> +{
> +uint64_t cookie = shm_rpc->xen_arg->params[0].u.value.b;
> +
> +free_optee_shm_buf(ctx, cookie);
> +
> +/*
> + * OP-TEE asks to free buffer, but this is not the same
> + * buffer we previously allocated for it. While nothing
> + * prevents OP-TEE from asking this, it is the strange
  ^ a

> + * situation. This may or may not be caused by a bug in
> + * OP-TEE or mediator. But is better to print warning.
  ^ it is

> + */
> +if ( call->rpc_data_cookie && call->rpc_data_cookie != cookie )
> +{
> +gprintk(XENLOG_ERR,
> +"Saved RPC cookie does not corresponds to OP-TEE's 
> (%"PRIx64" != %"PRIx64")\n",
  ^ correspond


> +call->rpc_data_cookie, cookie);
> +
> +WARN();
> +}
> +call->rpc_data_cookie = 0;
> +}
>  unmap_domain_page(shm_rpc->xen_arg);
>  }
>  
> @@ -1464,10 +1492,6 @@ static void handle_rpc_cmd(struct optee_domain *ctx, 
> struct cpu_user_regs *regs,
>  }
>  break;
>  case OPTEE_RPC_CMD_SHM_FREE:
> -free_optee_shm_buf(ctx, shm_rpc->xen_arg->params[0].u.value.b);
> -if ( call->rpc_data_cookie ==
> - shm_rpc->xen_arg->params[0].u.value.b )
> -call->rpc_data_cookie = 0;
>  break;
>  default:
>  break;
> -- 
> 2.26.2
> 



Re: UEFI support in ARM DomUs

2020-06-22 Thread Stefano Stabellini
On Mon, 22 Jun 2020, Julien Grall wrote:
> > > > For the first part (__XEN_INTERFACE_VERSION__) I think we can provide it
> > > > via
> > > > 
> > > > CFLAGS or something. This can also be done for the location of Xen
> > > > headers.
> > > 
> > > __XEN_INTERFACE_VERSION__ should work through the CFLAGS. An alternative
> > > would be to allow the user to specify through the Kconfig.
> > 
> > You mean specifying via Kconfig something like "0x00040d00"?
> 
> Possibly yes.
> 
> > 
> > And what about the headers? How will we provide their location if we decide
> > not to include those
> > 
> > in the tree?
> 
> I would do through Kconfig as well.

If we specify the external location of the Xen headers via Kconfig, it
seems to me that we should be able to detect the interface version
automatically from any Makefile as part of the build. No need to ask the
user.

However, if Oleksandr is thinking of using the Xen headers for the
hypercalls definitions, then I think we might not need external headers
at all because that is a stable interface, as Julien wrote. We could
just define our own few headers for just what you need like Linux does.

If you can do that, I think it would be better because we decouple the
UBoot build from the Xen build completely. We don't even need the Xen
tree checked out to build UBoot. It would be a huge advantage because it
makes it far easier to build-test changes for others in the community
that don't know about Xen and also it becomes far easier to integrate
into any build system.



[qemu-mainline test] 151286: regressions - FAIL

2020-06-22 Thread osstest service owner
flight 151286 qemu-mainline real [real]
http://logs.test-lab.xenproject.org/osstest/logs/151286/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-qemuu-rhel6hvm-intel 10 redhat-install   fail REGR. vs. 151065
 test-amd64-i386-libvirt-xsm  12 guest-start  fail REGR. vs. 151065
 test-amd64-i386-freebsd10-amd64 11 guest-start   fail REGR. vs. 151065
 test-amd64-i386-libvirt-pair 21 guest-start/debian   fail REGR. vs. 151065
 test-amd64-i386-qemuu-rhel6hvm-amd 10 redhat-install fail REGR. vs. 151065
 test-amd64-amd64-libvirt 12 guest-start  fail REGR. vs. 151065
 test-amd64-amd64-libvirt-xsm 12 guest-start  fail REGR. vs. 151065
 test-amd64-i386-libvirt  12 guest-start  fail REGR. vs. 151065
 test-amd64-i386-freebsd10-i386 11 guest-startfail REGR. vs. 151065
 test-amd64-amd64-libvirt-pair 21 guest-start/debian  fail REGR. vs. 151065
 test-amd64-amd64-qemuu-nested-amd 10 debian-hvm-install  fail REGR. vs. 151065
 test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrict 10 debian-hvm-install 
fail REGR. vs. 151065
 test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict 10 debian-hvm-install 
fail REGR. vs. 151065
 test-amd64-amd64-qemuu-nested-intel 10 debian-hvm-install fail REGR. vs. 151065
 test-arm64-arm64-libvirt-xsm 12 guest-start  fail REGR. vs. 151065
 test-amd64-amd64-xl-qemuu-debianhvm-amd64-shadow 10 debian-hvm-install fail 
REGR. vs. 151065
 test-amd64-i386-xl-qemuu-debianhvm-amd64 10 debian-hvm-install fail REGR. vs. 
151065
 test-amd64-amd64-xl-qemuu-debianhvm-amd64 10 debian-hvm-install fail REGR. vs. 
151065
 test-amd64-i386-xl-qemuu-debianhvm-amd64-shadow 10 debian-hvm-install fail 
REGR. vs. 151065
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 10 debian-hvm-install fail 
REGR. vs. 151065
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 10 debian-hvm-install fail 
REGR. vs. 151065
 test-amd64-i386-xl-qemuu-ovmf-amd64 10 debian-hvm-install fail REGR. vs. 151065
 test-amd64-amd64-xl-qemuu-ovmf-amd64 10 debian-hvm-install fail REGR. vs. 
151065
 test-amd64-amd64-xl-qcow210 debian-di-installfail REGR. vs. 151065
 test-amd64-amd64-libvirt-vhd 10 debian-di-installfail REGR. vs. 151065
 test-amd64-i386-xl-qemuu-debianhvm-i386-xsm 10 debian-hvm-install fail REGR. 
vs. 151065
 test-amd64-amd64-xl-qemuu-debianhvm-i386-xsm 10 debian-hvm-install fail REGR. 
vs. 151065
 test-armhf-armhf-xl-vhd  10 debian-di-installfail REGR. vs. 151065
 test-armhf-armhf-libvirt-raw 10 debian-di-installfail REGR. vs. 151065
 test-armhf-armhf-libvirt 12 guest-start  fail REGR. vs. 151065
 test-amd64-amd64-xl-qemuu-win7-amd64 10 windows-install  fail REGR. vs. 151065
 test-amd64-i386-xl-qemuu-win7-amd64 10 windows-install   fail REGR. vs. 151065
 test-amd64-amd64-xl-qemuu-ws16-amd64 10 windows-install  fail REGR. vs. 151065
 test-amd64-i386-xl-qemuu-ws16-amd64 10 windows-install   fail REGR. vs. 151065

Tests which did not succeed, but are not blocking:
 test-amd64-i386-xl-pvshim12 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-s

Re: [PATCH v2 2/2] optee: allow plain TMEM buffers with NULL address

2020-06-22 Thread Volodymyr Babchuk


Hi Stefano,

Stefano Stabellini writes:

> On Fri, 19 Jun 2020, Volodymyr Babchuk wrote:
>> Trusted Applications use popular approach to determine required size
>> of buffer: client provides a memory reference with the NULL pointer to
>> a buffer. This is so called "Null memory reference". TA updates the
>> reference with the required size and returns it back to the
>> client. Then client allocates buffer of needed size and repeats the
>> operation.
>> 
>> This behavior is described in TEE Client API Specification, paragraph
>> 3.2.5. Memory References.
>> 
>> OP-TEE represents this null memory reference as a TMEM parameter with
>> buf_ptr = 0x0. This is the only case when we should allow TMEM
>> buffer without the OPTEE_MSG_ATTR_NONCONTIG flag. This also the
>> special case for a buffer with OPTEE_MSG_ATTR_NONCONTIG flag.
>> 
>> This could lead to a potential issue, because IPA 0x0 is a valid
>> address, but OP-TEE will treat it as a special case. So, care should
>> be taken when construction OP-TEE enabled guest to make sure that such
>> guest have no memory at IPA 0x0 and none of its memory is mapped at PA
>> 0x0.
>> 
>> Signed-off-by: Volodymyr Babchuk 
>> ---
>> 
>> Changes from v1:
>>  - Added comment with TODO about possible PA/IPA 0x0 issue
>>  - The same is described in the commit message
>>  - Added check in translate_noncontig() for the NULL ptr buffer
>> 
>> ---
>>  xen/arch/arm/tee/optee.c | 27 ---
>>  1 file changed, 24 insertions(+), 3 deletions(-)
>> 
>> diff --git a/xen/arch/arm/tee/optee.c b/xen/arch/arm/tee/optee.c
>> index 6963238056..70bfef7e5f 100644
>> --- a/xen/arch/arm/tee/optee.c
>> +++ b/xen/arch/arm/tee/optee.c
>> @@ -215,6 +215,15 @@ static bool optee_probe(void)
>>  return true;
>>  }
>>  
>> +/*
>> + * TODO: There is a potential issue with guests that either have RAM
>> + * at IPA of 0x0 or some of theirs memory is mapped at PA 0x0. This is
>^ their
>
>> + * because PA of 0x0 is considered as NULL pointer by OP-TEE. It will
>> + * not be able to map buffer with such pointer to TA address space, or
>> + * use such buffer for communication with the guest. We either need to
>> + * check that guest have no such mappings or ensure that OP-TEE
>> + * enabled guest will not be created with such mappings.
>> + */
>>  static int optee_domain_init(struct domain *d)
>>  {
>>  struct arm_smccc_res resp;
>> @@ -725,6 +734,15 @@ static int translate_noncontig(struct optee_domain *ctx,
>>  uint64_t next_page_data;
>>  } *guest_data, *xen_data;
>>  
>> +/*
>> + * Special case: buffer with buf_ptr == 0x0 is considered as NULL
>> + * pointer by OP-TEE. No translation is needed. This can lead to
>> + * an issue as IPA 0x0 is a valid address for Xen. See the comment
>> + * near optee_domain_init()
>> + */
>> +if ( !param->u.tmem.buf_ptr )
>> +return 0;
>
> Given that today it is not possible for this to happen, it could even be
> an ASSERT. But I think I would just return an error, maybe -EINVAL?

Hmm, looks like my comment is somewhat misleading :(

What I mean, is that param->u.tmem.buf_ptr == 0 is the normal situation.
This is the special case, when OP-TEE treats this buffer as a NULL. So
we are doing nothing there. Thus, "return 0".

But, as Julien pointed out, we can have machine where 0x0 is the valid
memory address and there is a chance, that some guest will use it as a
pointer to buffer.

> Aside from this, and the small grammar issue, everything else looks fine
> to me.
>
> Let's wait for Julien's reply, but if this is the only thing I could fix
> on commit.
>
>
>>  /* Offset of user buffer withing OPTEE_MSG_NONCONTIG_PAGE_SIZE-sized 
>> page */
>>  offset = param->u.tmem.buf_ptr & (OPTEE_MSG_NONCONTIG_PAGE_SIZE - 1);
>>  
>> @@ -865,9 +883,12 @@ static int translate_params(struct optee_domain *ctx,
>>  }
>>  else
>>  {
>> -gdprintk(XENLOG_WARNING, "Guest tries to use old tmem 
>> arg\n");
>> -ret = -EINVAL;
>> -goto out;
>> +if ( call->xen_arg->params[i].u.tmem.buf_ptr )
>> +{
>> +gdprintk(XENLOG_WARNING, "Guest tries to use old tmem 
>> arg\n");
>> +ret = -EINVAL;
>> +goto out;
>> +}
>>  }
>>  break;
>>  case OPTEE_MSG_ATTR_TYPE_NONE:
>> -- 
>> 2.26.2
>> 


-- 
Volodymyr Babchuk at EPAM


Re: [PATCH v3 7/7] tools/proctrace: add proctrace tool

2020-06-22 Thread Tamas K Lengyel
> +rc = xenforeignmemory_unmap_resource(fmem, fres);
> +
> +if (rc) {
> +fprintf(stderr, "Failed to unmap resource\n");
> +return 1;
> +}
> +
> +rc = xc_vmtrace_pt_disable(xc, domid, vcpu_id);
> +
> +if (rc) {
> +fprintf(stderr, "Failed to call xc_vmtrace_pt_disable\n");
> +return 1;
> +}

Looks like you are missing an xc_interface_close here.

> +
> +return 0;
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> --
> 2.20.1
>
>



Re: sysbus failed assert for xen_sysdev

2020-06-22 Thread Jason Andryuk
On Mon, Jun 22, 2020 at 5:17 PM Mark Cave-Ayland
 wrote:
>
> On 22/06/2020 21:33, Jason Andryuk wrote:
>
> > Hi,
> >
> > Running qemu devel for a Xen VM is failing an assert after the recent
> > "qdev: Rework how we plug into the parent bus" sysbus changes.
> >
> > qemu-system-i386: hw/core/qdev.c:102: qdev_set_parent_bus: Assertion
> > `dc->bus_type && object_dynamic_cast(OBJECT(bus), dc->bus_type)'
> > failed.
> >
> > dc->bus_type is "xen-sysbus" and it's the
> > `object_dynamic_cast(OBJECT(bus), dc->bus_type)` portion that fails
> > the assert.  bus seems to be "main-system-bus", I think:
> > (gdb) p *bus
> > $3 = {obj = {class = 0x5636d780, free = 0x77c40db0 ,
> > properties = 0x563f7180, ref = 3, parent = 0x563fe980}, parent
> > = 0x0, name = 0x563fec60 "main-system-bus", ...
> >
> > The call comes from hw/xen/xen-legacy-backend.c:706
> > sysbus_realize_and_unref(SYS_BUS_DEVICE(xen_sysdev), &error_fatal);
> >
> > Any pointers on what needs to be fixed?
>
> Hi Jason,
>
> My understanding is that the assert() is telling you that you're plugging a
> TYPE_SYS_BUS_DEVICE into a bus that isn't derived from TYPE_SYSTEM_BUS. A 
> quick look
> at the file in question suggests that you could try changing the parent class 
> of
> TYPE_XENSYSBUS from TYPE_BUS to TYPE_SYSTEM_BUS to see if that helps?

Hi, Mark.

Thanks, but unfortunately changing xensysbus_info .parent does not
stop the assert.  But it kinda pointed me in the right direction.

xen-sysdev overrode the bus_type which was breaking sysbus_realize.
So drop that:

--- a/hw/xen/xen-legacy-backend.c
+++ b/hw/xen/xen-legacy-backend.c
@@ -824,7 +825,7 @@ static void xen_sysdev_class_init(ObjectClass
*klass, void *data)
 DeviceClass *dc = DEVICE_CLASS(klass);

 device_class_set_props(dc, xen_sysdev_properties);
-dc->bus_type = TYPE_XENSYSBUS;
+//dc->bus_type = TYPE_XENSYSBUS;
 }

 static const TypeInfo xensysdev_info = {

Then I had a different instance of the failed assert trying to attach
xen-console-0 to xen-sysbus.  So I made this change:
--- a/hw/xen/xen-legacy-backend.c
+++ b/hw/xen/xen-legacy-backend.c
@@ -789,6 +789,7 @@ static void xendev_class_init(ObjectClass *klass,
void *data)
 set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 /* xen-backend devices can be plugged/unplugged dynamically */
 dc->user_creatable = true;
+dc->bus_type = TYPE_XENSYSBUS;
 }

 static const TypeInfo xendev_type_info = {

Then it gets farther... until
qemu-system-i386: hw/core/qdev.c:439: qdev_assert_realized_properly:
Assertion `dev->realized' failed.

dev->id is NULL. The failing device is:
(gdb) p *dev.parent_obj.class.type
$12 = {name = 0x56207770 "cfi.pflash01",

Is that right?

I'm going to have to take a break from this now.

Regards,
Jason



[linux-5.4 test] 151288: regressions - FAIL

2020-06-22 Thread osstest service owner
flight 151288 linux-5.4 real [real]
http://logs.test-lab.xenproject.org/osstest/logs/151288/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-amd64-xl-credit2  16 guest-localmigrate   fail REGR. vs. 151232
 test-amd64-i386-libvirt-xsm  19 guest-start.2fail REGR. vs. 151232

Tests which did not succeed, but are not blocking:
 test-amd64-i386-xl-pvshim12 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  14 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt  13 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-arndale  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  14 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  13 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  14 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 12 migrate-support-checkfail   never pass
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop fail never pass
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop  fail never pass
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop  fail never pass
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop fail never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 14 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt 13 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt 14 saverestore-support-checkfail   never pass
 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop fail never pass
 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop fail never pass
 test-arm64-arm64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 14 saverestore-support-checkfail   never pass
 test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop  fail never pass
 test-armhf-armhf-xl-vhd  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  13 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 12 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 13 saverestore-support-checkfail   never pass
 test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop  fail never pass
 test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2  fail never pass

version targeted for testing:
 linux67cb016870e2fa9ffc8d34cf20db5331e6f2cf4d
baseline version:
 linuxfd8cd8ac940c8b45b75474415291a3b941c865ab

Last test of basis   151232  2020-06-19 05:33:37 Z3 days
Testing same since   151288  2020-06-22 07:40:16 Z0 days1 attempts


People who touched revisions under test:
  Aaron Brown 
  Adrian Hunter 
  Al Viro 
  Alan Maguire 
  Alex De

Re: UEFI support in ARM DomUs

2020-06-22 Thread Oleksandr Andrushchenko

On 6/23/20 4:20 AM, Stefano Stabellini wrote:
> On Mon, 22 Jun 2020, Julien Grall wrote:
> For the first part (__XEN_INTERFACE_VERSION__) I think we can provide it
> via
>
> CFLAGS or something. This can also be done for the location of Xen
> headers.
 __XEN_INTERFACE_VERSION__ should work through the CFLAGS. An alternative
 would be to allow the user to specify through the Kconfig.
>>> You mean specifying via Kconfig something like "0x00040d00"?
>> Possibly yes.
>>
>>> And what about the headers? How will we provide their location if we decide
>>> not to include those
>>>
>>> in the tree?
>> I would do through Kconfig as well.
> If we specify the external location of the Xen headers via Kconfig, it
> seems to me that we should be able to detect the interface version
> automatically from any Makefile as part of the build. No need to ask the
> user.
>
> However, if Oleksandr is thinking of using the Xen headers for the
> hypercalls definitions, then I think we might not need external headers
> at all because that is a stable interface, as Julien wrote. We could
> just define our own few headers for just what you need like Linux does.

This is a good idea: I'll try to get the minimal set of headers from Linux

instead of Xen as those seem to be well prepared for such a use-case. This

way we'll have headers in U-boot tree and guarantee that we have a minimal

subset which is easier to maintain. I'll keep you updated on the progress

>
> If you can do that, I think it would be better because we decouple the
> UBoot build from the Xen build completely. We don't even need the Xen
> tree checked out to build UBoot. It would be a huge advantage because it
> makes it far easier to build-test changes for others in the community
> that don't know about Xen and also it becomes far easier to integrate
> into any build system.