On Tue, Mar 27, 2018 at 3:52 AM, Richard Henderson < richard.hender...@linaro.org> wrote:
> On 03/25/2018 05:24 AM, Michael Clark wrote: > > Running with `-d in_asm,op,op_opt,out_asm` is very helpful > > for debugging. Note: due to a limitation in QEMU, the backend > > disassembler is not compiled, unless the backend matches > > the front-end, so `scripts/disas-objdump.pl` is required > > to decode the emmitted RISC-V assembly when using the x86_64 > > front-end. > > Certainly not. The configure mistake, I think, is > > - riscv) > + riscv*) > disas_config "RISCV" > > because for host $ARCH is going to be riscv64 not riscv. Oh my mistake. Thanks for pointing this out. I'll fix this in v2. > > +int cpu_signal_handler(int host_signum, void *pinfo, > > + void *puc) > > +{ > > + siginfo_t *info = pinfo; > > + ucontext_t *uc = puc; > > + greg_t pc = uc->uc_mcontext.__gregs[REG_PC]; > > + int is_write = 0; > > You're going to have to fill this in for many guests to work. A data > write to > the same page for which we have executed code will fire here. > > If your host kernel does not supply the proper info via ucontext_t or > siginfo_t > (highly recommended, assuming the hardware reports this as part of the > fault), > then you'll need to do something as brute force as reading from the host > PC and > disassembling to see if it was a host store insn. > Apparently we don't have this in our ucontext and changing it would require an ABI change. It seems siginfo_t only contains sa_addr. We have space reserved in ucontext. If we were to add it to our ucontext, we could use 0 for unknown. It seems we'll need to use the host PC and disassemble the instruction. > I believe you can see this with e.g. sparc from our > linux-user-test-0.3.tgz on > the qemu wiki. > > > +/* optional instructions */ > > +#define TCG_TARGET_HAS_goto_ptr 1 > > +#define TCG_TARGET_HAS_movcond_i32 0 > > Future: Does your real hardware do what the arch manual describes and > predicate > a jump across a single register move instruction? Either way, for output > code > density you may wish to implement > > movcond_i32 out,x,y,in,out,cc > as > bcc x, y, .+8 > mov out, in > > rather than allow the tcg middle-end to expand to a 5 insn sequence. See > e.g. > i386, ppc, s390 where we do exactly this when the hardware does not > support a > real conditional move insn. Okay I'll implement movcond as a bcc +8 and mv. > + if ((ct & TCG_CT_CONST_N12) && val >= -2047 && val <= 2047) { > > +2048? We use this constraint for a negatable immediate and the constraint is only applied to sub. We have no subi, so we implement subi as addi rd, rs1, -imm case INDEX_op_sub_i32: if (c2) { tcg_out_opc_imm(s, is32bit ? OPC_ADDI : OPC_ADDIW, a0, a1, -a2); } else { tcg_out_opc_reg(s, is32bit ? OPC_SUB : OPC_SUBW, a0, a1, a2); } break; > > +/* Type-S */ > > + > > +static int32_t encode_simm12(uint32_t imm) > > +{ > > + return ((imm << 20) >> 25) << 25 | ((imm << 27) >> 27) << 7; > > Probably more legible as > > extract32(imm, 0, 5) << 7 | extract32(imm, 5, 7) << 25 I can change these to extract32. I actually wrote code to generate these from instruction set metadata so that I could avoid manual transcription errors > > +/* Type-SB */ > > + > > +static int32_t encode_sbimm12(uint32_t imm) > > +{ > > + return ((imm << 19) >> 31) << 31 | ((imm << 21) >> 26) << 25 | > > + ((imm << 27) >> 28) << 8 | ((imm << 20) >> 31) << 7; > > +} > > Similarly. > > > +static void tcg_out_movi(TCGContext *s, TCGType type, TCGReg rd, > > + tcg_target_long val) > > +{ > > + tcg_target_long lo = sextract64(val, 0, 12); > > + tcg_target_long hi = val - lo; > > + > > + RISCVInsn add32_op = TCG_TARGET_REG_BITS == 64 ? OPC_ADDIW : > OPC_ADDI; > > + > > + if (val == lo) { > > + tcg_out_opc_imm(s, OPC_ADDI, rd, TCG_REG_ZERO, val); > > + } else if (val && !(val & (val - 1))) { > > + /* power of 2 */ > > + tcg_out_opc_imm(s, OPC_ADDI, rd, TCG_REG_ZERO, 1); > > + tcg_out_opc_imm(s, OPC_SLLI, rd, rd, ctz64(val)); > > + } else if (TCG_TARGET_REG_BITS == 64 && > > + !(val >> 31 == 0 || val >> 31 == -1)) { > > + int shift = 12 + ctz64(hi >> 12); > > + hi >>= shift; > > + tcg_out_movi(s, type, rd, hi); > > + tcg_out_opc_imm(s, OPC_SLLI, rd, rd, shift); > > + if (lo != 0) { > > + tcg_out_opc_imm(s, OPC_ADDI, rd, rd, lo); > > + } > > Future: The other special case that happens frequently is loading of a > 64-bit > host address. E.g. for exit_tb after goto_tb, the address of the TB > itself. > You will want to test to see if auipc+addi can load the value before > falling > back to the full 64-bit constant load. > Good idea. I'll implement auipc+addi > Future: I'll note that your worst-case here is 8 insns. Consider using the > constant pool instead of really long sequences. I was thinking about using the constant pool. I'm in two minds about it, given the load to use latency vs icache bandwidth. It would need some benchmarking. > > +static void tcg_out_ldst(TCGContext *s, RISCVInsn opc, TCGReg data, > > + TCGReg addr, intptr_t offset) > > +{ > > + int32_t imm12 = sextract32(offset, 0, 12); > > + if (offset != imm12) { > > + if (addr == TCG_REG_ZERO) { > > + addr = TCG_REG_TMP0; > > + } > > + tcg_out_movi(s, TCG_TYPE_PTR, addr, offset - imm12); > > + } > > This isn't right. You need to add offset to the original ADDR, not > overwrite > it. Something like > > tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP0, offset - imm12); > if (addr != TCG_REG_ZERO) { > tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_REG_TMP0, addr); > } > addr = TCG_REG_TMP0; Thanks. This probably explains the bugs I am seeing. > +static void tcg_out_jump_internal(TCGContext *s, tcg_insn_unit *arg, > bool tail) > > +{ > > + TCGReg link = tail ? TCG_REG_ZERO : TCG_REG_RA; > > + ptrdiff_t offset = tcg_pcrel_diff(s, arg); > > + if (offset == sextract64(offset, 1, 12)) { > Alsom these tests need to shift the extract 1 bit left, so it was emitting the far jump. > > + /* short jump: -4094 to 4096 */ > > + tcg_out_opc_jump(s, OPC_JAL, link, offset); > > Err... the direct JAL encodes a 21-bit constant. What's the 4k test for? Brain fade. > > + } else if (offset == sextract64(offset, 1, 31)) { > should be: } else if (offset == sextract64(offset, 1, 31) << 1) { > + /* long jump: -2147483646 to 2147483648 */ > > + tcg_out_opc_upper(s, OPC_AUIPC, TCG_REG_TMP0, 0); > > + tcg_out_opc_imm(s, OPC_JALR, link, TCG_REG_TMP0, 0); > > + reloc_call(s->code_ptr - 2, arg); > > Check for riscv32 here, to avoid the real compare and elide the 64-bit > case? Will do. > > + } else { > > + /* far jump: 64-bit */ > > + tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP0, > (tcg_target_long)arg); > > + tcg_out_opc_imm(s, OPC_JALR, link, TCG_REG_TMP0, 0); > > Fold the final 12 bits into the JALR? Good idea. > > +static void tcg_out_mb(TCGContext *s, TCGArg a0) > > +{ > > + static const RISCVInsn fence[] = { > > + [0 ... TCG_MO_ALL] = OPC_FENCE_RW_RW, > > + [TCG_MO_LD_LD] = OPC_FENCE_R_R, > > + [TCG_MO_ST_LD] = OPC_FENCE_W_R, > > + [TCG_MO_LD_ST] = OPC_FENCE_R_W, > > + [TCG_MO_ST_ST] = OPC_FENCE_W_W, > > + [TCG_BAR_LDAQ] = OPC_FENCE_RW_R, > > + [TCG_BAR_STRL] = OPC_FENCE_W_RW, > > + [TCG_BAR_SC] = OPC_FENCE_RW_RW, > > + }; > > + tcg_out32(s, fence[a0 & TCG_MO_ALL]); > > This is wrong. In particular, TCG_BAR_* is irrelevant to OPC_FENCE. > More, TCG_MO_* are bit combinations. A good mapping might be > > uint32_t insn = OPC_FENCE; > if (a0 & TCG_MO_LD_LD) { > insn |= (1 << 25) | (1 << 21); /* PR | SR */ > } > if (a0 & TCG_MO_ST_LD) { > insn |= (1 << 24) | (1 << 21); /* PW | SR */ > } > if (a0 & TCG_MO_LD_ST) { > insn |= (1 << 25) | (1 << 20); /* PR | SW */ > } > if (a0 & TCG_MO_ST_ST) { > insn |= (1 << 24) | (1 << 20); /* PW | SW */ > } > > You could fold this into a table, but it's moderately clear like this. Okay. Thanks. I'll look at the linux kernel barrier implementation. There has been some discussion on the linux kernel mailing list about barriers... > > + case MO_Q: > > + /* Prefer to load from offset 0 first, but allow for overlap. > */ > > + if (TCG_TARGET_REG_BITS == 64) { > > + tcg_out_opc_imm(s, OPC_LD, lo, base, 0); > > + } else { > > + tcg_out_opc_imm(s, OPC_LW, lo, base, 0); > > + tcg_out_opc_imm(s, OPC_LW, hi, base, 4); > > Without extra constraints, you have to care for LO (or HI) overlapping > BASE. > > > + case INDEX_op_goto_tb: > > + if (s->tb_jmp_insn_offset) { > > + /* direct jump method */ > > + s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s); > > + /* should align on 64-bit boundary for atomic patching */ > > + tcg_out_opc_upper(s, OPC_AUIPC, TCG_REG_TMP0, 0); > > + tcg_out_opc_imm(s, OPC_JALR, TCG_REG_ZERO, TCG_REG_TMP0, 0); > > You're not actually using this path yet, right? > Probably better to remove it for now until all of the other pieces are > present. > I'm not sure. I'll instrument it. I remember seeing reloc_call being called but i'm not sure how these are generated. I'll read the TCG code... > > + case INDEX_op_br: > > + tcg_out_reloc(s, s->code_ptr, R_RISCV_CALL, arg_label(a0), 0); > > + tcg_out_opc_upper(s, OPC_AUIPC, TCG_REG_TMP0, 0); > > + tcg_out_opc_imm(s, OPC_JALR, TCG_REG_ZERO, TCG_REG_TMP0, 0); > > You should be able to just use JAL here. 1MB range should be smaller than > any > one TB. There is never a BR opcode between different TB; that's the > GOTO_TB > opcode. > Okay great. Thanks for the detailed feedback...