On 19.02.2024 16:20, Oleksii wrote:
> On Mon, 2024-02-19 at 15:12 +0100, Jan Beulich wrote:
>> On 19.02.2024 15:00, Oleksii wrote:
>>> On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote:
>>>> On 05/02/2024 15:32, Oleksii Kurochko wrote:
>>>>> --- /dev/null
>>>>> +++ b/xen/arch/riscv/include/asm/cmpxchg.h
>>>>> @@ -0,0 +1,237 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>>>> +/* Copyright (C) 2014 Regents of the University of California
>>>>> */
>>>>> +
>>>>> +#ifndef _ASM_RISCV_CMPXCHG_H
>>>>> +#define _ASM_RISCV_CMPXCHG_H
>>>>> +
>>>>> +#include <xen/compiler.h>
>>>>> +#include <xen/lib.h>
>>>>> +
>>>>> +#include <asm/fence.h>
>>>>> +#include <asm/io.h>
>>>>> +#include <asm/system.h>
>>>>> +
>>>>> +#define ALIGN_DOWN(addr, size)  ((addr) & (~((size) - 1)))
>>>>> +
>>>>> +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier,
>>>>> acquire_barrier) \
>>>>> +({ \
>>>>> +    asm volatile( \
>>>>> +        release_barrier \
>>>>> +        " amoswap" sfx " %0, %2, %1\n" \
>>>>> +        acquire_barrier \
>>>>> +        : "=r" (ret), "+A" (*ptr) \
>>>>> +        : "r" (new) \
>>>>> +        : "memory" ); \
>>>>> +})
>>>>> +
>>>>> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier,
>>>>> acquire_barrier) \
>>>>> +({ \
>>>>> +    uint32_t *ptr_32b_aligned = (uint32_t
>>>>> *)ALIGN_DOWN((unsigned
>>>>> long)ptr, 4); \
>>>>> +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 -
>>>>> sizeof(*ptr)))
>>>>> * BITS_PER_BYTE; \
>>>>> +    uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \
>>>>> +    uint8_t mask_h = mask_l + mask_size - 1; \
>>>>> +    unsigned long mask = GENMASK(mask_h, mask_l); \
>>>>> +    unsigned long new_ = (unsigned long)(new) << mask_l; \
>>>>> +    unsigned long ret_; \
>>>>> +    unsigned long rc; \
>>>>> +    \
>>>>> +    asm volatile( \
>>>>> +        release_barrier \
>>>>> +        "0: lr.d %0, %2\n" \
>>>>
>>>> I was going to ask why this is lr.d rather than lr.w. But I see
>>>> Jan 
>>>> already asked. I agree with him that it should probably be a lr.w
>>>> and
>>>> ...
>>>>
>>>>> +        "   and  %1, %0, %z4\n" \
>>>>> +        "   or   %1, %1, %z3\n" \
>>>>> +        "   sc.d %1, %1, %2\n" \
>>>>
>>>> ... respectively sc.w. The same applies for cmpxchg.
>>>
>>> I agree that it would be better, and my initial attempt was to
>>> handle
>>> 4-byte or 8-byte boundary crossing during 2-byte access:
>>>
>>> 0 1 2 3 4 5 6 7 8
>>> X X X 1 1 X X X X
>>>
>>> In this case, if I align address 3 to address 0 and then read 4
>>> bytes
>>> instead of 8 bytes, I will not process the byte at address 4. This
>>> was
>>> the reason why I started to read 8 bytes.
>>>
>>> I also acknowledge that there could be an issue in the following
>>> case:
>>>
>>> X  4094 4095 4096
>>> X    1   1    X
>>> In this situation, when I read 8 bytes, there could be an issue
>>> where
>>> the new page (which starts at 4096) will not be mapped. It seems
>>> correct in this case to check that variable is within one page and
>>> read
>>> 4 bytes instead of 8.
>>>
>>> One more thing I am uncertain about is if we change everything to
>>> read
>>> 4 bytes with 4-byte alignment, what should be done with the first
>>> case?
>>> Should we panic? (I am not sure if this is an option.)
>>
>> Counter question (raised elsewhere already): What if a 4-byte access
>> crosses a word / cache line / page boundary? Ideally exactly the
>> same would happen for a 2-byte access crossing a respective boundary.
>> (Which you can achieve relatively easily by masking off only address
>> bit 1, keeping address bit 0 unaltered.)
> But if we align down on a 4-byte boundary and then access 4 bytes, we
> can't cross a boundary. I agree that the algorithm is not correct, as
> it can ignore some values in certain situations. For example:
> 0 1 2 3 4 5 6 7 8
> X X X 1 1 X X X X
> In this case, the value at address 4 won't be updated.
> 
> I agree that introducing a new macro to check if a variable crosses a
> boundary is necessary or as an option we can check that addr is 2-byte
> aligned:
> 
> #define CHECK_BOUNDARY_CROSSING(start, end, boundary_size)
> ASSERT((start / boundary_size) != (end / boundary_size))
> 
> Then, it is necessary to check:
> 
> CHECK_BOUNDARY_CROSSING(start, end, 4)
> CHECK_BOUNDARY_CROSSING(start, end, PAGE_SIZE)
> 
> But why do we need to check the cache line boundary? In the case of the
> cache, the question will only be about performance, but it should still
> work, shouldn't it?

You don't need to check for any of these boundaries. You can simply
leverage what the hardware does for misaligned accesses. See the
various other replies I've sent - I thought things should have become
pretty much crystal clear by now: For 1-byte accesses you access the
containing word, by clearing the low two bits. For 2-byte accesses
you also access the containing word, by clearing only bit 1 (which
the naturally leaves no bit that needs clearing for the projected
[but not necessary] case of handling a 4-byte access). If the resulting
4-byte access then is still misaligned, it'll fault just as a non-
emulated 4-byte access would. And you don't need to care about any of
the boundaries, not at words, not at cache lines, and not at pages.

Jan

Reply via email to