On Mon, Mar 15, 2021 at 10:38 PM Ben Dooks <ben.do...@codethink.co.uk> wrote: > > On 13/03/2021 07:20, Dmitry Vyukov wrote: > > On Fri, Mar 12, 2021 at 9:12 PM Ben Dooks <ben.do...@codethink.co.uk> wrote: > >> > >> On 12/03/2021 16:25, Alex Ghiti wrote: > >>> > >>> > >>> Le 3/12/21 à 10:12 AM, Dmitry Vyukov a écrit : > >>>> On Fri, Mar 12, 2021 at 2:50 PM Ben Dooks <ben.do...@codethink.co.uk> > >>>> wrote: > >>>>> > >>>>> On 10/03/2021 17:16, Dmitry Vyukov wrote: > >>>>>> On Wed, Mar 10, 2021 at 5:46 PM syzbot > >>>>>> <syzbot+e74b94fe601ab9552...@syzkaller.appspotmail.com> wrote: > >>>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>> syzbot found the following issue on: > >>>>>>> > >>>>>>> HEAD commit: 0d7588ab riscv: process: Fix no prototype for > >>>>>>> arch_dup_tas.. > >>>>>>> git tree: > >>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git fixes > >>>>>>> console output: > >>>>>>> https://syzkaller.appspot.com/x/log.txt?x=1212c6e6d00000 > >>>>>>> kernel config: > >>>>>>> https://syzkaller.appspot.com/x/.config?x=e3c595255fb2d136 > >>>>>>> dashboard link: > >>>>>>> https://syzkaller.appspot.com/bug?extid=e74b94fe601ab9552d69 > >>>>>>> userspace arch: riscv64 > >>>>>>> > >>>>>>> Unfortunately, I don't have any reproducer for this issue yet. > >>>>>>> > >>>>>>> IMPORTANT: if you fix the issue, please add the following tag to > >>>>>>> the commit: > >>>>>>> Reported-by: syzbot+e74b94fe601ab9552...@syzkaller.appspotmail.com > >>>>>> > >>>>>> +riscv maintainers > >>>>>> > >>>>>> This is riscv64-specific. > >>>>>> I've seen similar crashes in put_user in other places. It looks like > >>>>>> put_user crashes in the user address is not mapped/protected (?). > >>>>> > >>>>> I've been having a look, and this seems to be down to access of the > >>>>> tsk->set_child_tid variable. I assume the fuzzing here is to pass a > >>>>> bad address to clone? > >>>>> > >>>>> From looking at the code, the put_user() code should have set the > >>>>> relevant SR_SUM bit (the value for this, which is 1<<18 is in the > >>>>> s2 register in the crash report) and from looking at the compiler > >>>>> output from my gcc-10, the code looks to be dong the relevant csrs > >>>>> and then csrc around the put_user > >>>>> > >>>>> So currently I do not understand how the above could have happened > >>>>> over than something re-tried the code seqeunce and ended up retrying > >>>>> the faulting instruction without the SR_SUM bit set. > >>>> > >>>> I would maybe blame qemu for randomly resetting SR_SUM, but it's > >>>> strange that 99% of these crashes are in schedule_tail. If it would be > >>>> qemu, then they would be more evenly distributed... > >>>> > >>>> Another observation: looking at a dozen of crash logs, in none of > >>>> these cases fuzzer was actually trying to fuzz clone with some insane > >>>> arguments. So it looks like completely normal clone's (e..g coming > >>>> from pthread_create) result in this crash. > >>>> > >>>> I also wonder why there is ret_from_exception, is it normal? I see > >>>> handle_exception disables SR_SUM: > >>> > >>> csrrc does the right thing: it cleans SR_SUM bit in status but saves the > >>> previous value that will get correctly restored. > >>> > >>> ("The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the > >>> value of the CSR, zero-extends the value to XLEN bits, and writes it to > >>> integer registerrd. The initial value in integerregisterrs1is treated > >>> as a bit mask that specifies bit positions to be cleared in the CSR. Any > >>> bitthat is high inrs1will cause the corresponding bit to be cleared in > >>> the CSR, if that CSR bit iswritable. Other bits in the CSR are > >>> unaffected.") > >> > >> I think there may also be an understanding issue on what the SR_SUM > >> bit does. I thought if it is set, M->U accesses would fault, which is > >> why it gets set early on. But from reading the uaccess code it looks > >> like the uaccess code sets it on entry and then clears on exit. > >> > >> I am very confused. Is there a master reference for rv64? > >> > >> https://people.eecs.berkeley.edu/~krste/papers/riscv-privileged-v1.9.pdf > >> seems to state PUM is the SR_SUM bit, and that (if set) disabled > >> > >> Quote: > >> The PUM (Protect User Memory) bit modifies the privilege with which > >> S-mode loads, stores, and instruction fetches access virtual memory. > >> When PUM=0, translation and protection behave as normal. When PUM=1, > >> S-mode memory accesses to pages that are accessible by U-mode (U=1 in > >> Figure 4.19) will fault. PUM has no effect when executing in U-mode > >> > >> > >>>> https://elixir.bootlin.com/linux/v5.12-rc2/source/arch/riscv/kernel/entry.S#L73 > >>>> > >>> > >>> Still no luck for the moment, can't reproduce it locally, my test is > >>> maybe not that good (I created threads all day long in order to trigger > >>> the put_user of schedule_tail). > >> > >> It may of course depend on memory and other stuff. I did try to see if > >> it was possible to clone() with the child_tid address being a valid but > >> not mapped page... > >> > >>> Given that the path you mention works most of the time, and that the > >>> status register in the stack trace shows the SUM bit is not set whereas > >>> it is set in put_user, I'm leaning toward some race condition (maybe an > >>> interrupt that arrives at the "wrong" time) or a qemu issue as you > >>> mentioned. > >> > >> I suppose this is possible. From what I read it should get to the > >> point of being there with the SUM flag cleared, so either something > >> went wrong in trying to fix the instruction up or there's some other > >> error we're missing. > >> > >>> To eliminate qemu issues, do you have access to some HW ? Or to > >>> different qemu versions ? > >> > >> I do have access to a Microchip Polarfire board. I just need the > >> instructions on how to setup the test-code to make it work on the > >> hardware. > > > > For full syzkaller support, it would need to know how to reboot these > > boards and get access to the console. > > syzkaller has a stop-gap VM backend which just uses ssh to a physical > > machine and expects the kernel to reboot on its own after any crashes. > > > > But I actually managed to reproduce it in an even simpler setup. > > Assuming you have Go 1.15 and riscv64 cross-compiler gcc installed > > > > $ go get -u -d github.com/google/syzkaller/... > > $ cd $GOPATH/src/github.com/google/syzkaller > > $ make stress executor TARGETARCH=riscv64 > > $ scp bin/linux_riscv64/syz-execprog bin/linux_riscv64/syz-executor > > your_machine:/ > > > > Then run ./syz-stress on the machine. > > On the first run it crashed it with some other bug, on the second run > > I got the crash in schedule_tail. > > With qemu tcg I also added -slowdown=10 flag to syz-stress to scale > > all timeouts, if native execution is faster, then you don't need it. > > Ok, not sure what's going on. I get a lot of errors similar to: > > > > 2021/03/15 21:35:20 transitively unsupported: ioctl$SNAPSHOT_CREATE_IMAGE: > > no syscalls can create resource fd_snapshot, enable some syscalls that can > > create it [openat$snapshot]
This is not an error, just a notification that some syscalls are not enabled in the kernel and won't be fuzzed. > Followed by: > > > 2021/03/15 21:35:48 executed 0 programs > > 2021/03/15 21:35:48 failed to create execution environment: failed to mmap > > shm file: invalid argument > > The qemu is 5.2.0 and root is Debian/unstable riscv64 (same as chroot > used to build the syz tools) This is an error. But I see it the first time ever. It comes from here: https://github.com/google/syzkaller/blob/fdb2bb2c23ee709880407f56307e2800ad27e9ae/pkg/osutil/osutil_unix.go#L119-L121 There should be pretty simple logic inside of syscall.Mmap. Perhaps you are using some older Go toolchain with incomplete riscv support? I think I've used 1.14 and 1.15. But there is already 1.16. You can always download a toolchain here: https://golang.org/dl/